Abstract Book (Responses)
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

View only
Abstract Submission NumberAbstract TitleInput Authors Abstract Text
18Automated detection of Malaria parasites using smartphonesOluwatoyin Sanni, Andela and Farouq Oyebiyi, Criterion AnalyticsMalaria has long been a problem in many areas of the world, nearly half of the world's population is at risk of malaria. In 2015, there were roughly 212 million malaria cases and an estimated 429,000 malaria deaths. One of the top causes of death in sub-Saharan Africa is Malaria and according to UNICEF, it is the leading killer of children. Around 90% of malaria cases occurred in Africa, where the lack of access to malaria diagnosis is largely due to a shortage of experts, the shortage of equipment being the secondary factor. Microscopic analysis of blood samples has been the conventional and preferred method of malaria diagnosis largely due to its effectiveness and accuracy. Thus, the importance to develop new tools that facilitate the rapid and easy diagnosis of malaria for areas with limited access to healthcare services cannot be overstated.
There are various ways in which Malaria can be detected; Antigen detection, molecular diagnosis, serology, drug resistance tests, and the most common which is Microscopic Diagnosis. This research presents a blood image processing and analysis methodology using a convolutional neural network to detect the presence of P.falciparum trophozoites, its stage and the percentage of white blood cells in each sample. The main differential factor is the usage of images exclusively acquired with low cost and accessible tools such as smartphones. This microscopic image captured by a smartphone is cleaned up by the removal of unwanted blood features/attributes, an example is noise, through several image processing algorithm; the clean up processes involve the image being converted to grayscale and a median filter with large window factor will be applied to eliminate the inner structures followed by the Otsu’s method. The remaining inner structures inside the optical circle are to be removed using a flood fill algorithm.
Using the TensorFlow convolutional neural networks classifier after the noise removal and cleaning up processes explained above, the image features extracted from each blood sample will be applied to train our neural network to predict the presence of malaria parasites species and identify the life cycle development stage. The convolutional neural network is adopted in this project because of its unique ability to sort images into categories even better than humans in some cases. The benchmark metric for this research are specificity, sensitivity, and accuracy, the goal is to improve on previous research that used support vector machines with a 78%, our target is to hit least 98% in these metrics.
The final result of our research work will be deployed on smartphones for wider coverage. This will help bridge the problem gap of reliance on expensive laboratory equipment and the performance of a human operator for diagnostic accuracy of malaria in the world.
24Unsupervised Density and Semantic Analysis of Co-expression Network for Identifying Gene Biomarkers of Alzheimer's DiseaseTulika, Kakati, Tezpur University; Hirak J, Kashyap, University of California Irvine; Dhruba K, Bhattacharyya, Tezpur UniversityUnsupervised Density and Semantic Analysis of Co-expression Network for
Identifying Gene Biomarkers of Alzheimer's Disease
Keywords: Clustering, Co-expression network, Gene biomarker, Alzheimer’s Disease
Analysis of gene expression data using computational methods provides better insight into progressions of neurodegenerative diseases, such as Alzheimer’s Disease (AD), and help further investigation of these diseases, as well as their effective drug development. Gene co-expression network (CEN) is a type of
graph, which encodes bio-entities, such as genes and proteins, and aids in the analysis of functionally related groups of these bio-entities, termed as network modules. Existing computational methods for extraction of CEN modules discard genes having less expression similarity, irrespective of their biological
relevance or semantic similarity.

In this work, we present a computational method named THD-Module Extractor, which extracts genes with less expression similarity and high semantic similarity, found to be located at the border of a network module. For an AD dataset [GSE4226], border genes found by our method are assessed to be related to the progression of AD using Gene Ontology enrichment analysis, pathway analysis, and KEGG enrichment analysis.

The proposed method operates on a microarray gene expression dataset and performs density estimation in the CEN using SSSim (Ahmed et al., 2014), in order to extract network modules. The genes located at the borders of these network modules are further analyzed using Lin’s semantic similarity measure (Lin,1998), and genes with high semantic similarity with the module center are extracted. The parameters used in density estimation are dynamically updated in each iteration of the module extraction process.

Our method is validated using datasets pertaining to various species, in terms of both statistical and biological significance measures, such as p value, q value, and KEGG enrichment scores (Enrico et al.,2010). Correlation analysis of each pair of genes in a large dataset is computationally expensive. This work leverages the parallel computing capabilities of the Graphics Processing Units (GPU) to find the SSSim correlation matrix, implemented using NVIDIA CUDA library.

1. Ahmed, Hasin Afzal, et al. "Shifting-and-scaling correlation based biclustering
algorithm." IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 11.6
(2014): 1239-1252.
2. Lin, Dekang. "An information-theoretic definition of similarity." Icml. Vol. 98. No. 1998. 1998.
3. Glaab, Enrico, et al. "TopoGSA: network topological gene set analysis." Bioinformatics 26.9
(2010): 1271-1272.
30Online Learning with Random Feedback GraphsMonireh Dabaghchian, George Mason University; Amir Alipour-Fanid, George Mason University; Kai Zeng, George Mason University; Qingsi Wang, Google; Peter Auer, University of LeobenIn a typical multi armed bandit problem, the agent observes the reward on the arm he pulls. However, there are
applications in online learning in which the agent takes an action without being able to observe the reward on this
action. In order to learn the best action though, the agent has the capability to observe potential rewards on one or
several other actions at each time step. In the worst case scenario, the agent can either act or observe per time step.
An application of this learning model is in the security of cognitive radio networks. More specifically, the application
is to find the optimal online learning-based attacking strategies for an attacker who wants to attack the cognitive
radios in this network. The analogy of this problem with MAB is that the attacker is the agent and the wireless
channels he attacks are the arms/actions. The attacker has a unique feature of not being able to observe the reward
on the attacked channel (whether attack was successful or not). Based on this feature, we define attackers with two
types of observation capabilities. In the first one, the attacker’s capabilities are limited such that if he attacks, he will
not have the chance to make any observation on another channel at that time step. Based on the second type, at
each time step, the attacker after attacking a channel chooses at least one other channel to observe.
We propose two novel online learning algorithms to solve each of these problems1
. Our first algorithm addresses
the online learning for an attacker with no observation capability per each time step that he attacks. For this
problem, the attacker in order to attack and learn the most rewarding channel at the same time, has no choice but
to dedicate some time steps for observation to identify most rewarding channels. For this purpose, we propose a
novel online learning algorithm that dynamically decides between attack and observation. Through our analysis, we
rigorously measure the performance of the algorithm in terms of regret upper and lower bounds. Based on our
analysis regret upper-bound matches its regret lower bound which proves the optimality of this algorithm. In this
case, the regret bound is in the order of (T^(2/3)). After coming up with our algorithm and its analysis, we realized
that if modeling properly, this problem can be also solved with feedback graphs2 which has recently been proposed
to address learning with side observations, leading to the same regret order. However, our algorithm is easier to
understand and implement and leads to a smaller constant factor in the regret upper-bound.
Our second algorithm is a novel online learning algorithm that provides an optimal attacking strategy for an attacker
with the capability of observing the reward on at least one other channel after the attack, during the same time step.
Similar to the first problem, this problem can also be modeled by Feedback graphs which still leads to a regret in the
order of O(T^(2/3)). This is in contrast to our solution which leads to a significant improvement with regret in the
order of O(T^(1/2)). In our solution, we propose a new framework that introduces the concept of random feedback
graphs to online learning algorithms. We derive the regret lower-bound as well and show it matches its regret upperbound
leading to a regret in the order of  (T^(1/2)).
Our theoretical study reveals important insights on the learning with observation capability: with no observation at
all while acting, the agent loses on the regret order, and with the observation of at least one other action while
acting, there is a significant improvement of the agent performance. Moreover, observation of multiple actions does
not give additional benefit to the attacker (only a constant scaling). Moreover, if an agent can only make side
observations, making random observations leads to a better performance in terms of regret order than making
deterministic observations.
1. M. Dabaghchian, A. Alipour-Fanid, K. Zeng, Q. Wang, P. Auer, “Optimal Online Learning with Randomized Feedback
Graphs with Application in PUE Attacks in CRN”, arXiv preprint, September 2017
2. N. Alon, N. Cesa-Bianchi, O. Dekel, T. Koren, “Online Learning with Feedback Graphs: Beyond Bandits”, arXiv preprint,
February 2015.
31Stochastic Gradient Descent: Going As Fast As Possible But Not FasterAlice Schoenauer Sebag, UCSF; Marc Schoenauer, INRIA-CNRS-UPSud-UPSay; Michele Sebag, INRIA-CNRS-UPSud-UPSayWhen applied to training deep neural networks, stochastic gradient descent (SGD) often incurs steady progression phases, interrupted by catastrophic episodes in which loss and gradient norm explode. A possible mitigation of such events is to slow down the learning process.
This paper presents a novel approach to control the SGD learning rate, that uses
two statistical tests. The first one, aimed at fast learning, compares the momentum of the normalized gradient vectors to that of random unit vectors and accordingly gracefully increases or decreases the learning rate. The second one is a change point detection test, aimed at the detection of catastrophic learning episodes; upon its triggering the learning rate is instantly halved.
Both abilities of speeding up and slowing down the learning rate allows the pro-
posed approach, called SALeRA, to learn as fast as possible but not faster. Experiments on standard benchmarks show that SALeRA performs well in practice, and compares favorably to the state of the art.
59FastNorm: Improving Numerical Stability of Deep Network Training with Efficient NormalizationSadhika, Malladi, Massachusetts Institute of Technology (work done at Cerebras Systems); Ilya, Sharapov, Cerebras SystemsNormalization in deep network training is an effective tool to combat the Internal Covariate Shift or the change in distributions of activations in hidden layers that can slow network convergence. While existing normalization methods improve convergence properties, these techniques require a significant amount of additional operations for computing the norms and applying them to activations or weights.

We propose a modification to weight-normalizing techniques that requires fewer computational operations. In finite precision, it offers the improved numerical stability and the reduction in accuracy variance. Our proposed method, FastNorm, exploits the structure of weight updates and infers the norms without explicitly calculating them, replacing a quadratic order computation with an O(n) one for a fully-connected layer. Subsequently it avoids replacing the weights with their normalized versions and instead applies the norms of the weights more efficiently to the output of the forward pass and to the delta computation. On a convolutional layer, FastNorm computes the norms conventionally but still applies them efficiently to the computations to avoid updating the weight matrix. Since the weights are not forced to deviate from the gradient descent path by this normalization, we see an improvement in the stability of the network as it trains. As a result, networks with FastNorm can tolerate a higher learning rate, which allows for faster convergence.
60Recommender System for Boosting Business Revenue based on Actionability and Sentiment AnalysisKatarzyna Tarnowska, University of North Carolina at CharlotteThe research presents a data-driven user-friendly Recommender System for Improving Customer Loyalty. It proposes a novel approach for building a knowledge-based recommender system in the business application area. Data Mining techniques were used to extract actionable knowledge from customer feedback data on the B2B heavy equipment repair services. The goal of the action rules is to convert Detractor customers into Promoter customers, and therefore improve the company's Net Promoter Score and growth performance. Clustering techniques is used to group semantically similar companies. Additionally, Text Mining techniques were used to preprocess complementary text comments left by customers. The overall sentiment towards service aspects is extracted and presented as the final recommendations using a variety of Data Visualization techniques. The optimal recommendations are calculated and presented to an end business user who can explore the system's recommendations in an interactive manner.
63Learning the Multilinear Structure of Visual DataMengjiao Wang, Imperial College London; Yannis Panagakis, Imperial College London; Patrick Snape, Imperial College London; Stefanos Zafeiriou, Imperial College LondonStatistical decomposition methods are of paramount importance in discovering the modes of variations of visual data. Probably the most prominent linear decomposition method is the Principal Component Analysis (PCA), which discovers a single mode of variation in the data. However, in practice, visual data exhibit several modes of variations. For instance, the appearance of faces varies in identity, expression, pose etc. To extract these modes of variations from visual data, several supervised methods, such as the TensorFaces, that rely on multilinear (tensor) decomposition (e.g., Higher Order SVD) have been developed. The main drawbacks of such methods is that they require both labels regarding the modes of variations and the same number of samples under all modes of variations (e.g., the same face under different expressions, poses etc.). Therefore, their applicability is limited to well-organised data, usually captured in well-controlled conditions. In this paper, we propose the first general multilinear method, to the best of our knowledge, that discovers the multilinear structure of visual data in unsupervised setting. That is, without the presence of labels. We demonstrate the applicability of the proposed method in two applications, namely Shape from Shading (SfS) and expression transfer.
67A Retinal Vasculature Tracking System Using A Deep Network and Particle FilteringFatmatulzehra Uslu, Imperial College London, Anil Anthony Bharath, Imperial College London The change of retinal vessel width in human eyes has been associated with the incidence of eye-related and systematic diseases such as diabetic retinopathy and hypertension, respectively. Vessel width has been estimated from fundus images by using the intensity distribution across vessels, by modelling the spatial distribution of intensity using, for example, parametric functions such as Gaussians or Hermite polynomials. However, the approach may not be sufficient to estimate widths in situations where vessel boundaries are lost or the intensity distribution does not resemble the assumed function, such as at junctions, or in images with low contrast, noise, the central light reflex or pathologies.
In this study, we estimate vessel widths using the posterior probability distributions of vessel boundaries and centrelines with particle filtering. The probability distributions are obtained from a fully connected deep network, which transforms fundus image patches to vessel interior, centreline and boundary probability maps. The probabilities sampled across vessels from all probability maps are utilised to calculate the likelihoods of hypothesised vessel widths. Figure 1(a)-(b) shows vessel interior and boundary probability profiles with red and blue dashes respectively. Arrows represent sampling locations from a profile, whose colour matches with that of arrows (green arrows are sampled from the complement of the boundary probability profile). Hypothesised edge locations are given with E1 and E2. Figure 1(a) demonstrates a strong hypothesise where E1 and E2 match better with the locations of the maxima of the boundary probability profile, whereas Figure 1(b) illustrates a weak hypothesise. Also, the likelihoods calculated for Figure 1(a) is found to be 1555 times of that for Figure 1(b).
When compared with previous studies, the method uses the uncertainty in the detection of vessel interior, centreline and boundary pixels in fundus images to improve width estimation. Because the method does not rely on explicit modelling of the vessel cross-section intensity distribution, the challenging situations that degrade the performance of previous methods do not affect the performance of the proposed method as much. Vessels that are only faintly visible to the human eye, or with suppressed boundary information, appear to be well captured by the suggested method. Also, the likelihood calculations for particle filtering is simplified because the outputs of the trained network provide probabilities that can be used without post-processing, so that hundreds of width hypotheses can be quickly evaluated.
The performance of the method is evaluated on the REVIEW dataset, which is a benchmark for width estimation [Al-Diri, et al. 2008]. Although this dataset does not have binary vessel masks, the network trained on the DRIVE dataset [Staal et al. 2004], a benchmark for vessel segmentation, was found to be sufficient to generate related probability maps. Experiments on the REVIEW data showed that the method maintained self-consistent widths even for challenging vessels, such as those containing the central light reflex or pathologies; in addition, meaningful results were obtained for all profiles. Finally, the method was observed to surpass human observers at locating vessel boundaries in some challenging cases. Figure 2 shows the predicted (green lines) and reference (red lines) vessel cross-sections overlaid on the fundus image. The method is also found to produce consistent width estimates on junction regions and for low contrast vessels.
71Evaluating Gaussian Process Metamodels for Noisy Level Set EstimationXiong Lyu, University of California Santa Barbara; Mike Ludkovski, University of California Santa BarbaraWe consider the problem of learning the set of points for which a noisy black-box function exceeds some particular threshold. Such problems arise during experimental design for simulation-based algorithms for valuation of Bermudan options, and reliability analysis of engineering systems.
Metamodeling techniques are widely used for the purpose of efficiently reconstructing the latent function and the level set based on a small number of well-chosen simulations. In particular, Gaussian Process surrogates have become the leading non-parametric approach in analysis and sequential design of deterministic samplers.
However, there is limited work evaluating the performance of GP metamodels with stochastic samplers. In particular, we are motivated by contexts where the noise is heteroskedastic and non-Gaussian (heavy-tailed). These features have a strong impact on algorithm performance which inspires us to investigate alternative GP-based metamodels for learning the level sets.
Specifically, beyond standard GPs, we compare to two choices that make the metamodel more robust to noise mis-specification and observed outliers: (i) t-observation GPs; (ii) classification GPs that directly model the exceedance of the response vis-à-vis the threshold via a spatial logistic model driven by a GP. Our proposal of probit logistic regression sharpens the focus on the level-set objective while maintaining analytic tractability. As a third extension, we study the performance of GP surrogates with monotonicity constraints, in order to capture a common structure in applications whereby the level set is known to be connected.
To implement the above comparators, we extend existing Stepwise Uncertainty Reduction acquisition functions to the heteroskedastic stochastic setting. We also develop (approximate) updating formulas to efficiently compute such acquisition functions in a sequential design framework.
We present several simulation studies using synthetic (where ground truth is known) and realistic settings in 1-6 dimensions. The simulation studies are designed with different contexts: with different noise distributions, or with different signal-to-noise ratios, or in different dimensions. We compare results of the alternative GP-based metamodels with the benchmark GP and find that they offer substantial improvements compared to the standard Gaussian GP approach: (i) t-observation GP model has better performance, in terms of smaller empirical error (a main metric designed to compare performance of metamodels in stochastic setting), and smaller error rate, than Gaussian observation GP model, due to its robustness to outliers; (ii) the performance of the classification model is better when the noise is heavy tailed or the signal to noise ratio is small, since the classification model removes the outliers by considering the sign instead of value; (iii) for monotone latent function, the monotonic GP models have smaller empirical error than GP models without monotonicity constraint because the additional gradient constraint intrinsically lowers posterior uncertainty; (iv) the classification model and t- observation GPs appear to enjoy the fastest reduction in the empirical error as more inputs are sampled. As for the estimation of boundary between level sets, we observe that the t- observation GPs yield a tighter CI than the other models. As a further application, we consider contour-finding for determining the optimal exercise policy and timing value of a Bermudan Put option.
72Prediction and Resource Allocation for Hospital No-shows using Electronic Health Records DataQiaohui Lin, Brenda Betancourt, Benjamin A. Goldstein, Rebecca.C.SteortsHospital no-shows occur when a patient does not show to their scheduled appointment or cancels on the same day (appointments are normally scheduled at least 3 days in advance). The implications of late cancellations and patients not showing to their appointments range from lost revenue due to wasted
resources at the moment of the appointment to future additional costs associated with the increased risk of emergency visits or hospital admissions. For these reasons, strategies to reduce the proportion of no-shows are currently a priority for healthcare providers and administrators. Some of the strategies consist of setting up confirmation calls a few days prior to the appointment and overbooking.
In this work, we build a Bayesian hierarchical Logistic model to predict no-shows and minimize its loss using Electronic health records (EHR) collected from a large academic medical center. The data consists of over 2 million patients between ages of 18 and 89 years who had an appointment at an
outpatient specialty clinic from 2014-2016. Included in the data are a total of 14 specialties (e.g.cardiology, neurology, dermatology) and 61 clinics. In this particular EHR dataset, the average no-show rate across specialties is approximately 18%. The EHR data also contains representative patient demographics (age, gender, employment status, etc.), healthcare plan information, appointment type, patient response to appointment confirmation, among others, for a total of 67 covariates.
In order to guide variable selection, we induce shrinkage by imposing a Lasso penalty, which is equivalent to assigning a double exponential prior on the regression coefficients (Park and Casella, 2008). The model also includes a random effect to account for patient variation within each clinic (George et
al.,2017). We utilize the data augmentation approach based on Polya-Gamma latent variables proposed by Polson et al. (2013), which allows for a fully conjugate representation of the model and straightforward posterior inference through Gibbs sampling. As a result, we observe a good predictive performance
of the model on the test sets for most of the clinics with AUC values ranging between 70% and 91%. On the other hand, using bayesian decision theory, we help clinics decide the optimal overbooking rate and the optimal proportion of patients they should target with additional confirmation phone calls in order to
minimize the resource loss caused by no-shows. We design an asymmetric loss function for overbooking to punish high overbooking rate where patients come in but cannot get treated, and a symmetric loss function for phone call proportion where we expect same resource loss of extra calling and extra no-shows.Optimized overbooking among clinics has first and third quantile of 14% and 25%, optimized phone calls 13% and 21%.

1. Qiaohui Lin is a graduate student in Duke University, qiaohui.lin@duke.edu; Brenda Bentancourt is a postdoctoral fellow
in Department of Statistical Science at Duke University, bbetanc1@soe.ucsc.edu;Benjamin.A.Goldstein is Assistant Professor
of Biostatistics and Bioinformatics at Duke University, joint with the Duke Clinical Research Institute.2424 Erwin Road,
Suite 11041,Durham, NC 27705,benjamin.a.goldstein@duke.edu; Rebecca.C.Steorts is an Assistant Professor in Department
of Statistical Science at Duke Univeristy, 216 Old Chemistry, Durham, NC 27708, beka@stat.duke.edu;
2. Park,T.,Casella,G.(2008)."The Bayesian Lasso". Journal of the American Statistical Association,103(482),681-686
3. George,E.I.,Rockova,V., Rosenbaum,P.R.,Satopaa,V.A. and Silber,J.H.(2017). "Mortality Rate Estimation and Stan-
dardization for Public Reporting: Medicare's Hospital Compare." Journal of the American Statistical Association just-
accepted (2017)
4. Polson,N.G.,Scott,J.G.,and Windle,J.(2013)."Bayesian inference for logistic models using Polya-Gamma latent vari-
ables."Journal of the American statistical Association,108(504),1339-1349.
75Sparse Modular Graphs with Reciprocating RelationshipsXenia Miscouridou, University of Oxford; Francois Caron, University of Oxford; Yee Whye Teh, University of Oxford and Google Deep MindWe propose a class of models that uncovers reciprocity in temporal data of sparse networks with overlapping community structure. This family of models is based on Hawkes Processes, families of self or mutually exciting processes that can capture reciprocity in the dynamic appearance of edges.
The intensity of the Hawkes Processes is modelled using a compound completely random measure and can handle sparse graphs with power-law behaviour and node memberships to latent communities. We show that the proposed model uncovers both the network structure and the reciprocity in the appearance of temporal edges in real world networks and outperforms alternative models in link prediction.
96A Generic Neural Architecture for Multiple Inputs and OutputsTrang, Pham, Deakin University; Truyen, Tran, Deakin University; Svetha, Venkatesh, Deakin UniversityRecent machine learning research has been directed towards leveraging shared statistics among labels, instances and data views. These have given rise to fruitful research directions for multi-input data such as multi-view and multi-instance learning, for multi-output data such as multi-label and multi-task learning, and for a combination of multi-input and multi-output data such as multi-view+multi-label and multi-instance+multi-label learning. We simplify this notion by the term multi-X learning.

In this paper, we ask a bold question: Is this possible to build a generic neural architecture that simultaneously addresses many questions in multi-X learning? To effectively model the multi-X problems, the key property of an architecture must be to capture the shared statistics/representation of all input parts or labels while retaining specific features for each input or label. We propose a deep architecture called Column Bundle, which is a set of columns connected in a specific way for shared statistics. The notion of column is inspired by the columnar structure in neocortex and is implemented by a deep network in our model. Each input part is represented as a mini-column, which is recurrently connected to the central column. The central column plays the role of an “executive function” seen in models of working memory. In the inference phase, the central column then sends output signals to separate outputs (e.g., label, task) that are also represented as mini-columns connected to the central column recurrently.

The Column Bundle architecture is suited for multi-X problems as it can capture the correlations among multiple parts of data. The central column embeds the shared representation of all the inputs or labels. The mini-columns read the shared representation from the central column and its own specific input signals, hence, both shared and specific representations are embedded in the mini-columns. The mini-columns interact through the central column, therefore the correlations among inputs or among labels are established through the bundle.

Our model is flexible that it can work on all of multi-X problems without changing the structure. For each type of data, we only need to set up the inputs and the outputs for the Column Bundle. This removes the need of implementing multiple solutions. Experiments were conducted on 9 datasets of 5 different mutli-X problems. Our generic model beats state-of-the-art baselines designed specifically for each problem.
103Using Machine Learning to Detect Potential Child Suicide Bombers in Northern NigeriaFrancisca Oladipo, Federal University Lokoja NigeriaThe aim of this research is to stop the use of young women, girls and children as suicide bombers by the Boko Haram terrorists through the development of a supervised machine learning algorithm to identify radicalization give-aways in them, classify the probability of attacks, and stop them with the intervention of security agencies.
Terrorism in Nigeria officially started in 2009 when the members of the Boko Haram sect rebelled against constituted authority of the Nigerian state [1]. The activities of the insurgents in Northern Nigeria had since resulted in over 20,000 deaths, thousands of abductions by the terrorists, and up to 3.5 million internal refugees who are scattered in internally displaced persons (IDP) camps across the North of Nigeria, Niger Republic, Cameroun and Chad. 56% of these IDPs are made up of women, young girls and children [2]. Following the technical defeat of the Boko Haram terrorists [3] and the subsequent capture of their operational base in the Sambisa forests [4], many of them had escaped into the population and resorted to attacking soft targets like schools, markets, recreational centres, IDP camps, etc using young children, girls and women as suicide bombers –the youngest so far being aged 7years. Several hundreds of women and young girls had been abducted, indoctrinated and radicalized by the insurgents, the most notable being the over 200 Chibok school girls who were kidnapped in 2014 and who are being released in batches by the insurgents. Many of the abductees are also being rescued by security agencies and paced in IDP camps, but findings revealed that most of the suicide bombing involving young children are being carried out by the radicalized ones known to people within the areas where they lived. Within 2017 alone, at least 83 have been used as suicide bomber [5], as a university teacher and STEM enthusiasts working with out of school girls in North Central Nigeria, findings revealed that about 72% of these are females, already living within the community, interacting with people. And this is the motivation for this research.
The objective of this work is to deploy a classification algorithm to parse and filter data on potential radicalization based on a set of sematic dimensions and sentiment analysis. This work leverages on previous similar researches through an extensive literature review [6], [7], [8], [9], etc, we discovered that previous similar approaches based on neural network, naïve bayes and support vector machine are limited in their abilities to individually predict the intentions of a would-be terrorist. We conducted several surveys to determine the factors responsible for the preference of women, girls and young children as suicide bombers; and the various signs and sentiments observed by people who had contacts with some of the suicide bombers and based on the result of the survey, we develop a classification algorithm incorporating behavioral analysis to parse specific parameters related to: identified signs of radicalization, hate messages, social media postings, etc. The ongoing phase in our research is the development an application based on the defined machine learning algorithm. Using our dataset, the application data set will identify signs of radicalization, predict likelihood of being used as suicide bomber, conduct facial recognition to show if the subject had recently been in a terrorist camp, or if the subject is a bona-fide resident of the IDP of deployment, determine the ageing factor of SMS, classify social media contents as radical (where available) and interface with law enforcements for quick intervention.
1. Nossiter, A. (2009). "Scores Die as Fighters Battle Nigerian Police". The New York Times. Retrieved August 2017
2. “Boko Haram's Bloodiest Year Yet: Over 9,000 Killed, 1.5 Million Displaced, 800 Schools Destroyed in 2014". Christian Post. Retrieved March 2015
3. “Nigeria Boko Haram: Militants 'technically defeated' - Buhari". BBC News. 24 December 2015.
4. “Boko Haram 'crushed' by Nigerian army in final forest stronghold”. The Independent. 24 December 2016.
5. “UNICEF: Boko Haram use of child bombers soars”. Aljazeera. 22 August 2017
6. Saha, Snehanshu & Aladi, Harsha & Kurien, Abu & Basu, Aparna. (2017). Future Terrorist Attack Prediction using Machine Learning Techniques. 10.13140/RG.2.2.17157.96488.
7. Enghin Omer (). “Using machine learning to identify jihadist messages on Twitter”. Unpublished Thesis, Department of Information Technology, Uppsala University.
8. Kris Shaffer (2017). “Detecting terrorism with AI”. Retrieved August 2017 from http://pushpullfork.com/2017/02/detecting-terrorism-with-ai/
9. K. Jha, Manoj. (2009). Dynamic Bayesian Network for Predicting the Likelihood of a Terrorist Attack at Critical Transportation Infrastructure Facilities. Journal of Infrastructure Systems - 15:1(31).
111The detour problem in a stochastic environment.Pegah, Fakhari, Indiana University; Arash, Khodadadi, Indiana University, Jerome, R., Busemeyer, Indiana UniversityWe designed a grid world task to study human planning and re-planning behavior in an unknown stochastic environment. Participants were asked to navigate from a random starting point to a random goal position while maximizing their reward. Because they were not familiar with the environment, they needed to learn its characteristics from experience to plan optimally. Later in the task, we randomly blocked the optimal path to investigate whether and how people adjust their original plans to find a detour. To this end, we compared different reinforcement learning (RL) models that were different on how they learned and represented the environment and how they planned to catch the goal.

There are two main classes of RL models that can provide a solution to our sequential decision problem. The first class finds the optimal policy by learning the model of the environment and is called the model-based approach. The second class, the model-free approach, is able to maximize the expected sum of future rewards \textit{without} knowing the characteristics of the environment, \cite{sutton_reinforcement_1998}. Therefore, the second approach is computationally efficient. But it only incorporates one-step rewards (plus the expected future reward from the next step) from a particular state into planning which limits its ability to plan when the start and goal positions change. There is a third RL algorithm, called successor representation, that learns a rough representation of the environment by storing the expected future visits of each state, \cite{dayan_improving_1993}. It is computationally less expensive than the model-based RL and can easily explain the planning behavior, but not the re-planing data. One possible solution to this problem is to combine the SR model with the model-based RL model (hybrid SR-MB), \cite{russek_predictive_2017}.

We found that majority of our participants were able to plan optimally. We also showed that people were capable of revising their plans when an unexpected event occurred. The result from the model comparison showed that the model-based reinforcement learning approach provided the best account for the data and outperformed heuristics in explaining the behavioral data in the re-planning trials.
115A Weakly Supervised Deep Model for Cyberbullying DetectionElaheh Raisi, Bert HuangThe advent of social media has revolutionized human communication, significantly improving individuals’ lives. However, we must consider some of its negative implications. Cyberbullying is one of the major adverse consequences of social media. We address the computational challenges associated with cyberbullying detection by developing a machine learning framework with three distinguishing characteristics: (1) it uses minimal supervision to learn the complex patterns of cyberbullying, (2) it consists of an ensemble of two learners that co-train one another, and (3) it incorporates the efficacy of distributed word and graph-node representations by training nonlinear deep models.
Many machine learning approaches to the cyberbullying problem consider supervised and text-based cyberbullying detection, classifying social media posts as ‘bullying’ or ‘non-bullying’. In these approaches, crowdsourced workers annotate the data and supervised machine learning algorithms train classifiers from this annotated data [1,2,3]. There are, however, several challenges related to these approaches. Fully annotating data requires human intervention, which is costly and time consuming. And without considering social context, differentiating bullying from less harmful behavior is difficult due to complexities underlying cyberbullying and related behavior.
In our proposed framework, the algorithm learns from weak supervision in the form of expert-provided key phrases that are highly indicative of bullying. The framework trains an ensemble of two learners in which each learner looks at the problem from a different perspective. One learner identifies bullying incidents by examining the language content in the message; another learner considers the social structure to discover bullying. Individual learners train each other to come to an agreement about the bullying concept.
We represent words and users as low-dimensional vectors of real numbers. We use word and user vectors as the input to nonlinear language-based and user-based classifiers, respectively. We examine two strategies when incorporating vector representations of words and users. First, we use existing doc2vec, which is an extension of word embedding, and node2vec models as inputs to the learners. Second, we create new embedding models specifically geared for analysis of cyberbullying, in which word and user vectors are trained during optimization of the model.
The model is trained by optimizing an objective function made up of two loss functions: (1) a co-training loss, which penalizes the disagreement between the nonlinear deep language-based model and the nonlinear deep user-based model; and (2) a weak-supervision loss that is the classification loss on weakly labeled messages.
We evaluate the proposed framework on data from Twitter, one of the social media platforms with frequent occurrence of cyberbullying. To assess the effectiveness of our model, we use post-hoc, crowdsourced annotation of detections by variations of our new model and non-deep method we previously analyzed. We quantitatively demonstrate that our weakly supervised deep model improves precision over a non-deep version of the model in identifying cyberbullying incidents. These experiments show that our proposed framework, which is a combination of weakly supervised, ensemble of two learners, and nonlinear deep models, performs better than a model lacks any one of these three characteristics.
[1] K. Reynolds, A. Kontostathis, and L. Edwards, “Using machine learning to detect cyberbullying,” in Proceedings of the 10th International Conference on Machine Learning and Applications,, 2011.
[2] V. Nahar, X. Li, C. Pang, "An Effective Approach for Cyberbullying Detection," Journal of Communications in Information Science and Management Engineering, 2013.
[3] Q. Huang, V.K. Singh, P.K. Atrey, “Cyber bullying detection using social and textual analysis,” Proceedings of the 3rd International Workshop on Socially-Aware Multimedia, 2014.
137Learning Detection with Diverse ProposalsSamaneh Azadi, UC Berkeley; Jiashi Feng, National University of Singapore; Trevor Darrell, UC Berkeley To predict a set of diverse and informative proposals with enriched representations, we introduce a differentiable Determinantal Point Process (DPP) layer that is able to augment the object detection architectures. Most modern object detection architectures, such as Faster R-CNN, learn to localize objects by minimizing deviations from the groundtruth but ignore correlation “between” multiple proposals and object categories. We propose a new loss layer added to the other two softmax classifier and bounding-box regressor layers (all included in the multi-task loss for training the deep model) which formulates the discriminative contextual information as well as mutual relation between boxes into a Determinantal Point Process (DPP) loss function. This DPP loss finds a subset of diverse bounding boxes using the inputs of the other two loss functions (namely, the probability of each proposal to belong to each object category as well as the coordinates of the proposals) and will reinforce them in finding more accurate object instances in the end. We employ our DPP loss to maximize the likelihood of an accurate selection given the pool of overlapping background and non-background boxes over multiple categories.
Besides, inference in state-of-the-art detection methods is generally based on Non-Maximum Suppression (NMS) which considers only the overlap between candidate boxes per class label and ignores their semantic relationship resulting in multi-labeled detections. We propose a DPP inference scheme to select a set of non-repetitive high-quality boxes per image taking into account spatial layout, category-level analogy between proposals, as well as their quality score obtained from deep trained model. We call our proposed end-to-end model as “Learning Detection with Diverse Proposals Network – LDDP-Net”.
Our proposed loss function for representation enhancement and more accurate inference can be applied on any deep network architecture for object detection. In our experiments, we focus on the Faster R-CNN model to show the significant performance improvement added by our DPP model. We demonstrate the effect of our proposed DPP loss layer in accurate object localization on the benchmark detection datasets PASCAL VOC and MS COCO based on average precision and average recall detection metrics. Our trainable DPP layer improves location and category specifications of final detected bounding boxes substantially during both training and inference without increasing the number of parameters of the network. Furthermore, we show that LDDP keeps it superiority over Faster R-CNN even if the number of proposals generated by LDPP is only ∼30% as many as those for Faster R-CNN.
140Knowledge Elicitation for Risk Assessment of Database Activity Monitoring with application for data sampling Hagit Grushka - Cohen, Ben Gurion University of the Negev; Bracha Shapira, Ben-Gurion University of the Negev; Lior Rokach, Ben-Gurion University of the NegevRanking and prioritization problems have been extensively studied and have a lot of practical uses. In some problem domains models cannot be transferred easily between similar sub-domains and obtaining tagged data is extremely expensive. One such domain is cyber security where anomaly detection models may use the same features but do not transfer well between customers due to different usage, vulnerabilities and regulations. Another such domain is health care where alerting on events require fine tuning to each hospital ward. In such domains tuning the model requires adding tagged examples, but these have to be tagged by security experts or physicians, leading to low adoption rates. Applying ranking for anomaly detection in these domains requires overcoming constant cold start problem and explainability to the end user in order to convince them to keep interacting with the system while it is improving.
In this paper we focus on a data set from the domain of Database Activity Management (DAM), monitoring user database access used for anomaly detection. Security systems for databases produce numerous alerts about anomalous activities and policy rule violations. Prioritizing these alerts will help security personnel focus their efforts on the most urgent alerts. Currently, this is done manually by security experts that rank the alerts or define static risk scoring rules. Existing solutions are expensive, consume valuable expert time, and do not dynamically adapt to changes in policy. Another challenge in this domain is data sampling, due to the high velocity nature of database systems that process 100K transactions per second, such systems audit only a portion of the vast number of transactions that take place and cannot save all of them due to prohibitive cost. Sampling methods for high velocity data have been studied in the domain of network packet monitoring (Jadidi et al, 2016) and rely on using different priors when sampling data based on the risk the data presents.
We present CyberRank, a novel algorithm for combining preference elicitation approach with a supervised ranking model. Instead of tagging data the expert user answers a short preference questionnaire of Analytic Hierarchical Processing (AHP). Synthetic examples are generated using a simple generative model of the data, these examples are then annotated using the AHP model. These tagged data bootstrap training of a preference learning algorithm such as Ranking SVM (Joachims, 2002). This approach also allows updating the model when new regulations are added or due to expert decision by changing the AHP weights and creating more synthetic data with the desired properties.
We evaluate different approaches with a new dataset of expert ranked pairs of database transactions, in terms of their risk to the organization. CyberRank outperforms baselines for the cold start scenario with error reduction of 20%.
Furthermore, we explore the use of risk scores produced by such model to guide the sampling process for auditing and gathering data. We show that informing the sampling algorithm with the produced risk score improves recall significantly, especially for domains where the positive samples are extremely rare.
146Sparse 3D Convolutional Networks for Efficient Object ClasssificationAnanya Gupta, The University of Manchester; Xiaofan Xu, Intel Corporation; Jonathan Byrne, Intel Corporation, David Moloney, Intel Corporation; Simon Watson, The University of Manchester; Hujun Yin, The University of ManchesterDeep Convolutional Neural Networks (DNN) provide state of the art results in a number of tasks such as image classification and voice recognition. They are inherently memory and compute intensive and hence are not suitable for embedded devices. However, with the advent of DNN accelerator chips, the performance of the networks is limited by the external memory bandwidth for parameter fetching from the DRAM. The problem becomes even more pronounced when dealing with 3D data for tasks like object detection since the data is of higher dimensionality than image data.

3D data is often sparse in nature and recent work has shown the efficacy of using the sparsity of the data to define how the convolution operation is performed. Inspired by work done on pruning the weights in 2D image classification networks such as AlexNet and VGGNet, we sparsify the weights and kernels for existing 3D object classification networks such as Voxnet. Our results show that 50% of the network weights can be pruned without loss in accuracy and pruning up to 85% of the weights causes only a 5% reduction in accuracy. Furthermore, these networks are easier to prune as compared to 2D networks which we believe is due to the sparse nature of 3D data, allowing for sparse features.

In this work, we examine the use of a sparse representation for the minimising the footprint of the sparsified weights. A combination of sparse input data and sparse weights will allow these networks to be used for real-time applications, such as robotics, by reducing the number of memory accesses on low power embedded systems.
154 A Framework for Jointly Learning Pharmacovigilance TasksShaika Chowdhury*, University of Illinois at Chicago; Chenwei Zhang, University of Illinois at Chicago; Philip S. Yu, University of Illinois at ChicagoSocial media have grown to be a crucial information source for pharmacovigilance studies where an increasing number of people post adverse reactions of medical drugs that are previously unreported. Aiming to effectively monitor various aspects of Adverse Drug Reactions (ADRs) from diversely expressed social medical posts, we propose a multi-task neural network framework that learns several tasks associated with ADR monitoring with different levels of supervisions collectively. Besides being able to correctly classify ADR posts and accurately extract ADR mentions from online posts, the proposed framework is also able to further understand reasons for which the drug is being taken, known as ‘indications’, from the given social media post. A coverage-based attention mechanism is adopted in our framework to help the model properly identify ‘phrasal’ ADRs and Indications that are attentive to multiple words in a post. Our proposed model is a variant of the basic recurrent neural network encoder-decoder such that it shares an encoder among all the tasks and uses a different decoder for each task. Our assumption is that the shared encoder would learn predictive representations capturing the nuances of each task and, hence, help disambiguate an ‘ADR’ from an ‘Indication’. On the other hand, a task-specific decoder decodes the shared encoder representation to produce task-specific output. Our framework is applicable in situations where limited parallel data for different pharmacovigilance tasks are available. We evaluate the proposed framework on real-world Twitter datasets, where the proposed model outperforms the state-of-the-art alternatives of each individual task consistently.
166Exploring Portability of Data Programming ParadigmNurzat Rakhmanberdieva, Saarland University; Thilo Boehm, SAP SE; Andreas Fritzler, SAP SE; Nurlanbek Duishoev, IQVIA; Dietrich Klakow, Saarland University; Machine learning methods, especially Deep Learning, had an enormous breakthrough in natural language processing and computer vision. They showed incredible performance in solving complex problems with minimum human interaction when large amount of labeled data is available. The hardest part is labeling large quantities of unlabeled data as it is time-consuming, expensive and requires expert knowledge. The Data Programming Paradigm [1] which was introduced at NIPS 2016 proposes a method that uses labeling functions. They are a set of heuristic rules that produce large, but noisy training data which is later denoised by a generative model of these labeling functions.

In this thesis, we explored portability of Data Programming Paradigm to new domains. We applied it to sequence labeling also known as Slot-filling for Spoken Language Understanding and Named Entity Extraction. First, to allow these tasks to be included as part of the pipeline, we modified the initial data processing and candidate generation steps in the model. Second, we introduced a new type of labeling functions to test the hypothesis that "lightly" trained models can serve as a solid labeling function in combination with other functions. In this context, "lightly" trained models denote Deep Learning methods such as Convolutional and Recurrent Neural Networks that are trained with a small subset of data. Third, we described the strategies to implement and select optimal labeling functions. Finally, we showed that Data Programming Paradigm can be successfully extended to such tasks and outperforms its counterparts on noisy data. The experimental results for Slot-filling showed that the for the clean data, Data Programming Paradigm achieved a 5.9 points better F1 score than the baseline. But on noisy data, it outperforms twice its counterparts such as Conditional Random Fields. We examined the model with benchmarks such as Air Travel Information System and SAP related datasets.

[1] Data Programming: Creating Large Training Sets, Quickly.
A. Ratner, C. De Sa, S.Wun, D. Selsam and C. Re.
Advances in Neural Information Processing Systems 29 (NIPS 2016).
172A Mathematical Model for Multi-class and Multi-criteria Classification ProblemsLoubna Benabbou, Ecole Mohammadia d'Ingenieurs, Mohammed V University in Rabat, Rabat, Morocco; Pascal Lang, Faculty of Business Administration, Laval University, Quebec, CanadaMulti-class and multi-criteria classification problems are more complex and un-
certain than ever in different areas such as healthcare, finance, supply chain management and risk management. They arise when dealing with multiple classes and when data and decision maker’s preferences are defined based on multiple criteria that may be conflicting and heterogeneous. In the literature, this problem is usually solved by series of dichotomization with outranking relations or utility functions. In our suggested approach, we address this problem by constructing a model based on statistical learning theory and mathematical programming. We set off by building a rule-based classifier then develop a mathematical program (mixed linear program) which maximizes the classifier’s performance. This is achieved by minimizing a compression bound on the generalization risk (expected loss) of the classifier. We provide bounds in terms of empirical risk and parameters of the multi-class and multi-criteria classifier. We also demonstrate how those parameters provide effective guidelines for model selection. Furthermore, the proposed classifier provides a degree of membership to the classes and measures the relative severity of different types of classification errors with asymmetrical valued loss function. Experimental results highlight the trade-off between accuracy and complexity.
174Teaching Particle Physics to Generative Adversarial Networks with Attribute Control in 3D SimulationMichela Paganini*, Yale University; Luke de Oliveira, Lawrence Berkeley National Laboratory; Benjamin Nachman, Lawrence Berkeley National LaboratoryGenerative Adversarial Networks (GANs) have seen immense interest and success in recent years. However, most tasks have existed solely within the domain of natural images. In contrast, we provide an overview of modifications and tricks necessary to make GANs work on data from the physical sciences. The simulation of scientific datasets, often used as test beds for the development and evaluation of application and domain specific machine learning algorithms, is, in various disciplines such as High Energy Particle Physics and Atmospheric Science, a slow and complex, yet necessary, step in the workflow. This contribution focuses on the encoding of domain-specific constraints and prior knowledge into an adversarial training, and on the effects of traversing high-dimensional physical manifolds to investigate the space of GAN-generated images. We provide select examples of domain-specific thought processes with respect to improving GAN training procedures, aiming to be a resource for researchers in any physical or applied science wishing to apply GANs to a choice problem.
177An inferential procedure for community structure validation in networks Luisa Cutillo, University of Naples Parthenope and University of Sheffiled ; Mirko Signorelli, Department of Medical Statistics and Bioinformatics of the Leiden University Medical Center. The growing availability of real world networks inspired the study of complex
networks in the multidisciplinary elds of social, technological and biological
networks. What makes networks so attractive? We are constantly dealing
with networks. Supermarkets use networks of customers to make custom
oers to targeted groups; banks orchestrate a complex system of money
exchange between them and clients; terrorists are connected in networks all
over the world; media networks dominate our lives and inside each living
being genes express and co-regulate themselves via complex networks, even
when we sleep. Networks constitute a mathematical modelling of a real
problem and understanding networks structure is the key to unravel the
secret message in data.
`Community structure' is a commonly observed feature of real networks. The
term refers to the presence in a network of groups of nodes (communities) that
feature high internal connectivity, and are poorly connected to each other.
Whereas the issue of community detection has been addressed in several works,
the problem of validating a partition of nodes as a good community structure for a network remains an open issue. We propose an inferential procedure for community structure validation of network partitions, which relies on concepts from network enrichment analysis [1]. We construct a set of community structure validation indices, relying on the hypothesis testing NEAT [1]. The proposed procedure allows to compare the validity of dierent partitions of nodes as community structures for a given network. Moreover, it can be employed to assess whether two networks share the same community structure, and to compare the performance of diefferent network clustering algorithms. We show the application of our overall strategy to the set of 30 tissue specic gene-networks inferred in [2].
Keywords: community structure validation, networks, network enrichment
References [1] Signorelli, M., Vinciotti, V. and Wit, E. (2016). NEAT: an efficient
network enrichment analysis test. BMC Bioinformatics, 17:352. [2] Gambardella et al. (2013). Differential network analysis for the identification of condition-specific pathway activity and regulation, Bioinformatics, 29:14, 1776-85.
181Let's Make Block Coordinate Descent Go Fast!Julie Nutini, University of British Columbia; Issam Laradji, University of British Columbia; Mark Schmidt, University of British Columbia; Warren Hare, University of British Columbia (Okanagan)Block coordinate descent (BCD) methods have become one of our key tools for solving some of the most important large-scale optimization problems. This is due to their typical ease of implementation, low memory requirements, cheap iteration costs, adaptability to distributed settings, ability to use problem structure, and numerical performance. Notably, they have been used for almost two decades in the context of L1-regularized least squares (LASSO) and support vector machines (SVMs). Indeed, randomized BCD methods have recently been used to solve instances of these widely-used models with a billion variables, while for "kernelized'' SVMs, greedy BCD methods remain among the state of the art. Due to the wide use of these models, any improvements on the convergence of BCD methods will affect a myriad of applications.

Three main algorithmic choices influence the performance of BCD methods; the block partitioning strategy, the block selection rule, and the block update rule. In this work we explore all three of these building blocks and propose variations for each that can lead to significantly faster BCD methods. We propose new greedy block-selection strategies that guarantee more progress per iteration than the classic greedy Gauss-Southwell rule, and give a general result characterizing the convergence rate obtained under both the Polyak- Lojasiewicz inequality and for general (potentially non-convex) functions. We also explore block update strategies that exploit higher-order information to yield faster local convergence rates and explore the use of message-passing to efficiently compute optimal block updates for problems with a sparse dependency between variables. Further, we show that greedy BCD methods have a finite-time manifold identification property for problems with separable non-smooth structures like bound constraints or L1-regularization. This analysis notably leads to bounds on the number of iterations required to reach the optimal manifold, and under certain conditions leads to superlinear convergence. In the special case of LASSO and SVM problems we show that optimal updates are possible, leading to finite convergence for problems with sufficiently-sparse solutions. We support our findings with numerical results.
195Deep Learning for Cardiovascular Disease Risk in the China Kadoorie BiobankYanting Shen, University of Oxford; Yang Yang, Shanghai Jiao Tong University; Tingting Zhu, University of Oxford; Sarah Parish, University of Oxford; Zhengming Chen, University of Oxford; Robert Clarke, University of Oxford; David Clifton, University of OxfordDeep Learning for Cardiovascular Disease Risk Evaluation in the China Kadoorie Biobank
Risk evaluation for cardiovascular disease (CVD) is important for enhancing population health and lower healthcare expenditure. In medical community, statistics models such as Cox proportional hazards model are widely used on large cohort studies for risk evaluation. However, conventional epidemiology studies require the risk factors be clinically meaningful, such as hypercholesterolemia, smoking, hypertension, and require defining their normal or pathological ranges. This poses difficulty to incorporating time-series measurements into the risk metric as the feature extraction process is highly heuristic. We propose a deep learning model to automatically predict risk scores from electrocardiogram (ECG) time-series waveform, without explicit pre-processing or feature extraction step before feeding into the input layer of the deep neural network. This will take the entire time- series waveform as a holistic high-dimensional risk factor.
Our model was developed using 10-second, 500 Hz, 12-lead ECG time-series from approximately 25,000 participants in the China Kadoorie Biobank (CKB). Our model contains three layers of one- dimensional convolutional neural network (1-D CNN), interlaced with Max-pooling layers and Dropout layers. Our model is able to predict 6 CVD classes with high accuracy (>90%) in balanced and independent test sets (Table 1).
We believe that this is the first study using deep learning for time-series analysis on a large cohort, especially with very short ECG recordings (10 seconds) and numerous disease classes. Our findings will lead to further evaluation of clinical use of our ECG risk scores and developing multi-type risk prediction models on the heterogeneous data (continuous non-time series, categorical, missing data) in the CKB.
(There is a table at the end of the original abstract which can not be inserted here.)
196ProjectHealth: Applying PGM for Project Success Calculation in the Delivery Based OrganizationOlga Tataryntseva, ELEKS Ltd.Leading a scalable delivery based organization means tracking its services on the everyday basis, making up-to-date decisions, preventing and fixing possible problems, etc. Nevertheless, there is no constant or frequent ability for the leader to go into details of each deliverable which is produced inside the organization. Tracking the level of business success and making decisions by the leader mostly depend on the reports and data retrieved from the sublevels of the organization. Therefore drilling down into details, studying the information sources, digging for the raw data is laborious especially if the business is constantly growing. Furthermore, the senior management who is working on being transparent about the organization goals could be interested in the ability to communicate these statements through the company members automatically.
Being unaware of existing solutions to maintain these tasks, we developed the solution named ProjectHealth which: (i) models the domain experts understanding about the successful project-based business; (ii) allows reasoning for the leaders based on actual data with ability to obtain the raw data with minimal efforts; (iii) allows decisions making relying on weak signals in scope of risk minimization; (iv) creates the indirect communication flow for different levels of the organization.
To achieve the goals, the Bayesian network has been constructed. The model presents the generalization of knowledge, experience, and expectations of the stakeholders from all departments of the delivery organization gathered during the brainstorming sessions and meetings. Its testing was divided into several steps: (i) interactive testing with the stakeholders by creating scenarios manually and playing them with automatic calculation of the inference value; (ii) launching the real data collection for several projects with constant checking of the inference state and comparing it to the observed project status. Most improvements have been provided in the first step of testing. Several nodes were added and deleted, connections were changed and CPDs were corrected. Afterwards, the calculation of success probability value for each project inside the company was started.
The extract, transform and load (ETL) process handles information retrieving from the multiple sources, e.g. issue tracking system, team collaboration software, customer relationship management system, human resource management system, code repository, financial management system. The model is built thus it needs no human input that relies on any biased estimations and personalized opinions. The ETL process and inference calculation are completely automatized with Python scripts. The raw data and inference calculation results are stored in the database, therefore, the history is saved.
The results of inference calculation are shown on the custom BI dashboard. It reveals the probability of success for the delivery business vertical as a whole, each delivery unit separately, and each project in particular. The directed graph structure provided an ability to realize the drill down through the inference results directly to the raw data.
ProjectHealth makes clear for all levels of the organization if any project has major problems with reaching the main project goals: schedule, quality, budget, resources, and scope. It reveals the reason that caused the risk or problem. By fixing it, the probability of the project success grows, which parametrically impacts the probability of the overall success of the delivery organization.
The size of the organization for which the model has been developed and successfully launched in production for permanent use is over 1000 employees.
200Resolution, Recommendation, and Explanation in Richly Structured Social NetworksPigi, Kouki, UCSC; Lise, Getoor, UCSCResolution, Recommendation, and Explanation in Richly Structured Social Networks
Online information today includes a variety of heterogeneous data networks such as social (e.g.

Facebook), media-sharing (e.g. YouTube), media-consumption (e.g. Netflix), and information-
sharing (e.g. Yelp). Such networks contain richly-structured social data that are immensely useful

in providing accurate recommendations to users. Utilizing such data requires the extraction of
knowledge both from their content (e.g., the description of the items) as well as their structure
(e.g., the friends of a user).
Although there are several challenges in such a setting, in our work we focus on the
following three challenges. First, given the size and heterogeneity of the data we may have items
that are coreferent, i.e., items appearing more than once with slight variations while referring to
the same physical entity. This problem is known as entity resolution (ER) and is defined as the
problem of identifying, matching, and merging references corresponding to the same entity within
a dataset. ER is commonly applied by the database community; in this work we propose to apply
it to the recommender systems domain. More specifically, we propose to perform ER over the set
of items in order to distinguish co-referent items that appear as different in the user-item
recommendation matrix but represent the same underlying entity. For example, two cameras
varying in their color may appear twice but may correspond to the exact same model. To perform
effective ER we propose to use a statistical relational learning framework, called probabilistic soft
logic, that allows to model and collectively reason over the rich structure of the network. Once we
have found the coreferent items, we can use their ratings to increase the density of the user-item
matrix. By increasing the density, we expect to have more accurate recommendations.
Additionally, if we can infer that a newly-added item is co-referent with an item already having
ratings, we can use these ratings to address the cold-start problem.
As a second challenge, since the data may be coming from several heterogenous data
sources, we need a way to provide recommendations over a wide range of information sources.
In previous works, hybrid recommender systems have been shown to provide superior
performance by combining multiple data modalities and modeling techniques within a single
model. In our recent work we propose such a hybrid recommender, that is extensible and can
make use of arbitrary data modalities to provide state-of-the-art performance. The hybrid
framework leverages the flexibility of probabilistic programming in order to incorporate and reason
over a wide range of information sources. Experimental studies on benchmark datasets showed
that our framework can effectively combine multiple information sources to improve
recommendations, achieving better performance than previous methods.
As a final challenge, although the performance of recommender systems has significantly
improved using hybrid methods like the one we proposed, most systems operate as black boxes,
i.e., they do not explain why a recommendation was made. Since users need meaningful
recommendations, recent work studies how to provide explanations. Typically, explanations from
a single recommendation algorithm come in a single style, e.g., a content-based recommender
system produces content-based explanations. However, for hybrid recommenders that combine
several data sources such as ratings, social network, or demographic, we need to provide hybrid
explanations that combine all relevant sources of information and present explanations of more
than one styles. Such explanations are both effective and desirable by the users. To the best of
our knowledge, there is no work that explains the recommendations provided by a hybrid
recommender engine. Here, we propose to leverage the output of the proposed hybrid framework
to provide hybrid explanations. We perform a mixed model statistical analysis of user preferences
for explanations in this system. Through an online user survey, we evaluate explanations for
hybrid algorithms in a variety of text and visual, graph-based formats, that are either novel designs
or derived from existing hybrid recommender systems.
204Learning Feature Engineering for ClassificationFatemeh Nargesian, Horst Samulowitz, Udayan Khurana, Tejaswini Pedapati, Deepak TuragaFeature engineering is the task of improving predictive modelling performance on a dataset by transforming its feature space. Existing approaches to automate this process rely on either transformed feature space exploration through evaluation-guided search, or explicit expansion of datasets with all transformed features followed by feature selection. Such approaches incur high computational costs in runtime and/or memory. We present a novel technique, called Learning Feature Engineering (LFE), for automating feature engineering in classification tasks. LFE is based on learning the effectiveness of applying a transformation (e.g., arithmetic or aggregate operators) on numerical features, from past feature engineering experiences. Given a new dataset, LFE recommends a set of useful transformations to be applied on features without relying on model evaluation or explicit feature expansion and selection. Using a collection of datasets, we train a set of neural networks, which aim at predicting the transformation that impacts classification performance positively. Our empirical results show that LFE outperforms other feature engineering approaches for an overwhelming majority (89%) of the datasets from various sources while incurring a substantially lower computational cost.
205Classifying Zebrafish Stripe Patterns using Topological Data Analysis and Multi-class Support Vector MachinesMelissa McGuirl, Brown University; Bjorn Sandstede, Brown UniversityZebrafish (Danio rerio) are characterized by their black and yellow stripes, which are made up of three main types of pigment cells: black melanophores, yellow xanthophores and silvery iridophores. Different types of cell mutations and changes in the strength of cell interactions can affect the width of the stripes, the distance between pigment cells, and the pattern of the stripes. In this ongoing research, we use existing mathematical models to generate zebrafish stripe patterns under various parameter regimes, corresponding to different pigment cell mutations and/or changes in cell interactions. The goal for this analysis is to use machine learning to automatically classify the mutation type and cell interactions of the zebrafish based on their stripe patterns. In this project’s first stage, our data consists of the coordinate information of the black melanophores and yellow xanthophores at the final stage of stripe development, whereas in this project’s second stage (work in progress) our data consists of the pigment cells’ coordinate information over a time series for the 3-week duration of stripe development. Utilizing methods from topological data analysis, we represent the stripe patterns by persistent diagrams. With this representation, the width of the cells is captured by the death time of the dimension 0 and dimension 1 features, the distance between pigment cells is captured by the birth time of the dimension 1 features, and the stripe patterns are captured by a combination of the dimension 0 and dimension 1 features. Thus, persistent diagrams expose the distinguishing characteristics that are necessary to classify the different stripe patterns. In order to use topological representations as input for a machine learning algorithm we then map the persistent diagrams into a finite-dimensional vector representation via persistent images. Lastly, we use a multi-class support vector machine to classify the mutation type and cell interactions based on the persistent image representations of the stripe data. The learning algorithm we use is a n(n − 1)/2 binary support vector machine model implemented with a one-versus-one design, where n = 5 is the number of classes for our data. We use the built-in MATLAB function ‘fitcecoc’ in this step. Our method classifies 1450 different stripe patterns at the final stage of development into 5 stripe pattern classes with an accuracy of 99.9%.
207DeepCoder: Semi-parametric Variational Autoencoders for Automatic Facial Action CodingLinh Tran, Imperial College London; Robert Walecki, Imperial College London; Ognjen (Oggi) Rudovic, MIT; Stefanos Eleftheriadis, Prowler.io; Björn Schuller, Imperial College London; Maja Pantic, Imperial College London(This work was accepted at ICCV 2017)
Human face exhibits an inherent hierarchy in its representations, from human face shapes and poses to specific facial expressions. Facial expressions are typically described in terms o
f the configuration and intensity of facial muscle actions using the Facial Action Coding System (FACS). FACS defines a unique set of 30+ atomic non-overlapping facial muscle actions named Action Units (AUs), with rules for scoring their intensity on a six-point ordinal scale. Variational auto-encoders (VAE) have shown great results in unsupervised extraction of hierarchical latent representations and make a suitable approach for learning facial features for AU intensity estimation. Yet, most existing VAE-based methods apply classifiers learned separately from the encoded features. By contrast, non-parametric (probabilistic) approaches, such as Gaussian Processes (GPs), typically outperform their parametric counterparts, but cannot deal easily with large amounts of data.

To this end, we propose a novel variational semi-parametric modelling framework, named DeepCoder, which combines the modelling power of parametric (convolutional) and non-parametric (GP) VAEs, for joint learning of (1) latent representations at multiple levels in a task hierarchy, and (2) classification of multiple ordinal outputs. Specifically, DeepCoder is a general framework that builds upon a hierarchy of any number of VAEs, where each coding/decoding part of the intermediate VAEs interacts with the neighbouring VAEs during learning, assuring the sharing of information in both directions. We illustrate this approach by designing an instance of DeepCoder as a two-level semi-parametric VAE -- the top level being the parametric VAE, and the bottom level being a non-parametric Variational Ordinal Gaussian Process VAE. We show on benchmark datasets for Facial Action Unit intensity estimation that the proposed DeepCoder outperforms the state-of-the-art approaches, and related VAEs and deep learning models.
210Attend and Predict: Understanding Gene Regulation by Selective Attention on ChromatinRitambhara Singh; Jack Lanchantin; Arshdeep Sekhon; Yanjun Qi, University of VirginiaThe past decade has seen a revolution in genomic technologies that enabled a flood of genome-wide profiling of chromatin marks. Recent literature tried to understand gene regulation by predicting gene expression from large-scale chromatin measurements. Two fundamental challenges exist for such learning tasks: (1) genome-wide chromatin signals are spatially structured, high-dimensional and highly modular; and (2) the core aim is to understand what the relevant factors are and how they work together. Previous studies either failed to model complex dependencies among input signals or relied on separate feature analysis to explain the decisions. This paper presents an attention-based deep learning approach, AttentiveChrome, that uses a unified architecture to model and to interpret dependencies among chromatin factors for controlling gene regulation. AttentiveChrome uses a hierarchy of multiple Long Short-Term Memory (LSTM) modules to encode the input signals and to model how various chromatin marks cooperate automatically. AttentiveChrome trains two levels of attention jointly with the target prediction, enabling it to attend differentially to relevant marks and to locate important positions per mark.
We evaluate the model across 56 different cell types (tasks) in humans. Not only is the proposed architecture more accurate, but its attention scores provide a better interpretation than state-of-the-art feature visualization methods such as saliency maps.
211A new clustering method for building multiple supertrees using k-meansNadia TahiriThe problem of merging multiple different gene trees into a single species tree is a fundamental issue in computational biology. Frequently, gene trees carry important information about specific evolutionary patterns which characterize the evolution of the corresponding gene families. However, this information can be lost when merging multiple gene trees into a single species consensus tree. Supertrees are species consensus trees which are typically assembled from a set of smaller phylogenetic trees (i.e., additive trees or X-trees) that were inferred using different gene families or different datasets (e.g., morphological or molecular). These smaller trees usually have different, but mutually overlapping sets of labeled leaves representing species.

We present a new heuristic algorithm for partitioning a given set of phylogenetic trees into a few clusters each of which can be represented by its own supertree. A specific version of the popular k-means algorithm will be used to partition a
given set of trees into one or several clusters. The Robinson and Foulds metric will be applied to measure the distance between trees. Two new cluster validity indices adapted for tree clustering will be used to determine the number of tree
clusters (i.e., number of supertrees) in a dataset. One of the main advantages of our algorithm is that it is much faster than the existing tree clustering approaches. This makes it particularly well suited for the analysis of large genomic datasets.
223Clustering of allergic response data through childhoodRebecca Howard, University of Manchester; Danielle Belgrave, Imperial College London; Panagiotis Papastamoulis, University of Manchester; Angela Simpson, University of Manchester and University Hospital of South Manchester; Magnus Rattray, University of Manchester; Adnan Custovic, Imperial College LondonThe temporal patterns of allergic sensitization can now be inspected at a greater resolution as a result of Component Resolved Diagnostics (CRD) (Treudler and Simon, 2013). We are using this data to group individuals with similar allergic sensitization trajectories. Our aim is to better understand co-sensitization patterns, relate sensitization patterns to allergy-related disease risk and inform the development of a more personalized approach to disease diagnosis, management and treatment.

Previous work has examined a particular immune response (component-specific IgE, sIgE) to a small subset of the available allergen components (15 components) across three time points (Custovic et al., 2015), and a higher-resolution set has been clustered for children at one time point (Simpson et al., 2015). Here, we scale up the analysis by considering the patterns of response to allergens from all available sources and across six time points from infancy to adolescence. We measured sIgE immune response in participants from a well-characterised population-based birth cohort at six follow-ups between the ages of 1 to 16 years (at ages 1, 3, 5, 8, 11 and 16). We used a Bernoulli mixture model with a Bayesian MCMC algorithm to learn clusters of sIgE components from binarized sensitization data, i.e. each cluster contains allergen-related components with a similar sensitization profile across the children. The model parameters and optimal number of clusters were inferred at each age. The model selection for the optimal number of clusters showed high confidence in the number of clusters, with the posterior probabilities for the optimal number >0.87 for the first five time points and >0.70 at age 16. Cluster membership was then inferred conditional on fixing the model order. The flow of allergens between clusters across time displays clear and consistent patterns in the data, and allergens cluster into increasingly specialised groups according to associated child responses. Though each age was clustered independently of the others, the clusters were biologically meaningful, had exceptionally high mean assignment probabilities, and displayed a high degree of consistency and stability across time points.

The cluster-based sensitization profiles of participants across these ages were then related to asthma and hay fever variables at age 16. When subject responses are stratified appropriately (taking into account the heterogeneous nature of both the subjects and the diseases themselves), the allergic response at age 5 can be strongly associated with the development of asthma and hay fever at age 16. We identified combinations of cluster, time point and degree of cluster sensitization that were clearly linked to an increased risk of asthma and hay fever development, as well as putative "lead" components (e.g. Fel d 1, from cat). Further application of this Bayesian clustering approach to similar data, and the continued exploration of the resulting clusters ought to facilitate the development of better diagnostic and prognostic biomarkers for allergic diseases.

Howard R, Belgrave D, Papastamoulis P, Simpson A, Rattray M, Custovic A. Evolution of IgE responses to multiple allergen components throughout childhood. J Allergy Clin Immunol (accepted)


Custovic A, Sonntag HJ, Buchan IE, Belgrave D, Simpson A, Prosperi MC. Evolution pathways of IgE responses to grass and mite allergens throughout childhood. J Allergy Clin Immunol 2015; 136(6): 1645-52 e1-8.

Simpson A, Lazic N, Belgrave DC, et al. Patterns of IgE responses to multiple allergen components and clinical symptoms at age 11 years. J Allergy Clin Immunol 2015; 136(5): 1224-31.

Treudler R, Simon JC. Overview of component resolved diagnostics. Curr Allergy Asthma Rep 2013; 13(1): 110-7.
225Gaussian Process Based Mapping of Environmental Parameters with Inaccurate Location InformationShuyu Lin*, University of Oxford; Niki Trigoni, University of Oxford; Steven Reece, University of OxfordAs mobile sensing systems are becoming ubiquitous, they are enabling a wide range of applications including condition monitoring, hazard tracking and environmental monitoring. Although device tracking systems have gained significant maturity, they are still prone to errors arising from noisy inertial sensors, lack of GPS data indoors or in Manhattan-type urban landscapes outdoors, unreliable radio measurements and so on. Tracking errors have a significant knock-on effect on spatial maps of environmental parameters, e.g. acoustic noise, temperature, humidity and so on. In this work, we propose a novel approach to processing location and environment sensor data generated by multiple mobile devices roaming through an area. We want to achieve two goals simultaneously as the data arrives in real time: 1) to evaluate the accuracy of location information; 2) to update the spatial distribution of the physical property of interest as soon as a new observation becomes available. We derive an algorithm by implementing a Gaussian Process (GP) inference model containing a classification step for accurate or inaccurate trajectories and a constrained optimisation to estimate the appropriate level of noise caused by the location errors. To test this algorithm, we implement our algorithm for a synthetic scenario, where various modes of location errors and additive white noise in the function measurements are introduced.
237Manipulating and Measuring Model InterpretabilityForough Poursabzi-Sangdeh, University of Colorado Boulder; Daniel G. Goldstein, Microsoft Research; Jake M. Hofman, Microsoft Research; Jennifer Wortman Vaughan, Microsoft Research; Hanna Wallach, Microsoft ResearchMachine learning models are often treated as black boxes and evaluated only in terms of their performance (e.g., accuracy) on held-out data sets. However, good performance on a held-out data set is seldom sufficient for people to trust a model and deploy it to make real-world decisions, and there is widespread consensus that people’s failure to understand a model can be problematic [1, 2]. In response to these concerns, there is a new line of research that focuses on developing interpretable methods for machine learning, either by developing new models that are inherently simple to understand [3] or by providing explanations or interpretations of existing complex models [4, 5, 6]. Despite the popularity of this line of research, there is no clear, agreed-upon definition of interpretability. Defining and quantifying interpretability therefore remains an open question.
Through large-scale randomized experiments, we vary factors that should make models more or less interpretable and, in turn, measure how these changes impact people’s decision making. We ask each participant to predict apartment prices in New York City, with the help of a model (linear regression). We show participants models that receive the same inputs and produce the same outputs, manipulating only the presentation of the models. We vary the number of features (two vs. eight) and the visibility of the model internals (clear vs. black box) in a 2 × 2 between-subject study. As a baseline, we also ask participants to predict apartment prices without the help of a model. For each experimental condition, we show participants a set of apartments, the model’s price predictions for those apartments, and the apartments’ sale prices. Next, we show participants a new set of apartments. For each one, we ask them to guess the model’s prediction. We then show them the model’s prediction and ask them to guess the sale price. Drawing on previous work [7, 8], we measure three different proxies for interpretability: 1) Simulation error (the participant’s error in guessing the model’s prediction); 2) trust (the participant’s confidence that the model has made the right prediction); and 3) prediction error (the participant’s error in guessing the sale price).
Our preliminary results indicate that, on average, participants in the two-feature, clear-model-internals experimental condition have lower simulation error. Interestingly, participants in the eight-feature, black-box-model-internals experimental condition do as well as participants in the eight-feature, clear-model-internals experimental condition. This result suggests that the number of features affects model interpretability. Despite these differences in simulation error, we find that participants’ prediction error is comparable across all four experimental conditions—i.e., participants appear to trust the models similarly.
We see this as the first of many possible experiments to guide the development of interpretable machine learning methods.
1. RichCaruana,YinLou,JohannesGehrke,PaulKoch,MarcSturm,andNoemieElhadad.Intelligiblemodelsforhealthcare:Predictingpneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1721–1730. ACM, 2015.
2. Been Kim. Interactive and interpretable machine learning models for human machine collaboration. PhD thesis, Massachusetts Institute of Technology, 2015.
3. Jongbin Jung, Connor Concannon, Ravi Shro, Sharad Goel, and Daniel G. Goldstein. Simple rules for complex decisions. arXiv preprint arXiv:1702.04690, 2017.
4. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
5. MarcoTulioRibeiro,SameerSingh,andCarlosGuestrin.Nothingelsematters:Model-agnosticexplanationsbyidentifyingpredictioninvariance. arXiv preprint arXiv:1611.05817, 2016.
6. Brian Y Lim, Anind K Dey, and Daniel Avrahami. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2119–2128. ACM, 2009.
7. Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
8. Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. 2017.
244Sparse One-Time Grab Sampling of InliersMaryam, Jaberi, University of Central Florida; Marianna, Pensky, University of Central Florida; Hassan, Foroosh, University of Central FloridaEstimating structures in "big data" and clustering them are among the most fundamental problems in computer vision, pattern recognition, data mining, and many other other research fields. Over the past few decades, many studies have been conducted focusing on different aspects of these problems. One of the main approaches that is explored in the literature to tackle the problems of size and dimensionality is sampling subsets of the data in order to estimate the characteristics of the whole population, e.g. estimating the underlying clusters or structures in the data. In this paper, we propose a “one-time-grab” sampling algorithm. This method can be used as the front end to any supervised or unsupervised clustering method. Rather than focusing on the strategy of maximizing the probability of sampling inliers, our goal is to minimize the number of samples needed to instantiate all underlying model instances. More specifically, our goal is to answer the following question: “Given a very large
population of points with C embedded structures and gross outliers, what is the minimum number of points r to be selected randomly in one grab in order to make sure with probability P that at least " points are selected on each structure, where " is the number of degrees of freedom of each structure.” This problem can be modeled using hypergeometric pmf. In this paper, we study this model and show the accuracy of the methods in choosing the sample size.
299Unsupervised state representation learning with robotic priors: a robustness benchmarkNatalia Diaz Rodriguez*, ENSTA ParisTech; Mathieu Seurin, ENSTA ParisTech; Timothee Lesort, ENSTA ParisTech; Xinrui Li, ENSTA ParisTech; David Filliat, ENSTA ParisTechOur understanding of the world depends highly on our capacity to produce intuitive and simplified representations which can be easily used to solve problems. We reproduce this simplification process using a neural network to build a simple low dimensional state representation of the world from images acquired by a robot. As proposed in Jonschkowski et al. 2015, we train a neural
network in an unsupervised way using prior knowledge about the world as loss functions called robotic priors. We extend this approach to high dimension richer images to learn a 3D representation of the hand position of a robot from RGB images in a Reinforcement Learning (RL) setting for the task of pushing a button. We propose a quantitative evaluation of the learned representation using nearest neighbors in the state space that allows to assess its quality and show both the potential and limitations of robotic priors in realistic environments. We add varied distractors and domain randomization, all crucial components to achieve transfer learning in robotics control. Finally, we also contribute a new prior to tackle the vulnerabilities found in our benchmark. It aims at solving the identified cases where partial degradation in the state
representation learned space takes place, and it is based on an alignment reference-point. The ultimate objective of such low dimensional state representation ranges from easing RL and knowledge transfer across tasks, to facilitating learning from raw data with more efficient and compact high level representations.
306Gradient-free Policy Architecture Search and AdaptationSayna Ebrahimi, UC Berkeley; Anna Rohrbach, MPII; Trevor Darrell, UC Berkeley We develop a method for policy architecture search and adaptation via gradient-free optimization which can learn to perform autonomous driving tasks. By learning from both demonstration and environmental reward we develop a model that can learn with relatively few early catastrophic failures. We first learn an architecture of appropriate complexity to perceive aspects of world state relevant to the expert demonstration, and then mitigate the effect of domain-shift during deployment by adapting a policy demonstrated in a source domain to rewards obtained in a target environment. We show that our approach allows safer learning than baseline methods, offering a reduced cumulative crash metric over the agent's lifetime as it learns to drive in a realistic simulated environment.
333Emergent Communication in a Multi-Modal, Multi-Step Referential GameKatrina Evtimova, NYU; Andrew Drozdov, NYU; Douwe Kiela, FAIR; Kyunghyun Cho, NYUInspired by previous work on emergent communication in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal communication significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.
334Gabor-based Letter RecognitionCarolyn Murray, Northwestern University; Klinton Bicknell, Northwestern UniversityOne major component of the visual comprehension of linguistic units is the determination of letter identity and order. Recent models for this component have yet to incorporate knowledge about the neurobiological mechanisms of shape/form recognition. There are widely-used techniques in computer vision that effectively model human-like object recognition by extracting higher-level features of an image. One of these techniques is Gabor-based wavelet convolutions which mimic the responsivity of simple cells in the primary visual cortex. Current models for letter recognition either code for perceptual similarity of letters or attempt to account for effects of letter position (such as transposition effects), but not both. Here, we propose a Gabor-based model of letter recognition that simultaneously models perceptual similarity between letters and letter position, in a way that fits within the neurobiological understanding of this mechanism.

We compare these Gabor-based representations with pixel-based representations as a baseline. For the Gabor model, an array of Gabor wavelets is convolved with an image of a token (letter or sequence of letters) and the resulting responses serve as a representation of token identity. In the pixel model, the grayscale values at each pixel of the letter stimuli serve as measures of objective visual similarity. Within each model, recognition was simulated by adding Gaussian noise to the token representations and performing Bayesian inference to infer the distribution over tokens. We correlated these posterior distributions over tokens with human behavioral letter confusion rates to evaluate performance. Results showed that letter recognition models based on Gabor wavelets outperform models based on pixel representation in producing likelihoods of letter confusion. This is the case at all levels of perceptual “noise” and over four behavioral datasets. Also, correspondence between the rates of letter confusion in model and behavioral data required matching of font and case of the stimuli, suggesting that both the models and humans are sensitive to subtle visual changes in letter shape.

These results suggest that the Gabor-based model partially re-instantiates a biological mechanism and mimics behavioral results of human letter recognition. Application of this technique to more general contexts would be able to mechanize a reader’s tolerance to departures in letter order in word recognition tasks. By scaling the Gabor wavelets to larger sizes, this technique could capture lexical orthographic effects of word-level representations. Further, this technique can be implemented into models of eye movements in reading to further understand how visual information is collected during reading. This finding is another example of the utility of the implementation of simplified neurobiological mechanisms solving machine perception problems.
338Discovering Words and Objects from Speech and ImagesDavid Harwath, MIT; Galen Chuang, Wellesley/MIT; James Glass, MITDiscovering linguistic meaning and structure from a continuous speech signal is a well-established problem in speech processing and computational linguistics. Our study explores neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that audio-visual associative localization maps emerge from network-internal representations learned as a by-product of training to perform semantic retrieval. Our models operate directly on speech audio signals, and do not utilize any form of supervised speech recognition or transcriptions.
345Music Instrument Detection Using LSTMs and the NSynth DatasetBrandi Frisbie, Center for Computer Research in Music and Acoustics -- Stanford UniversityMusic instrument recognition is an important part of music information retrieval. Instrument detection could lead to better music tagging in recommendation systems, create better scoring for automatic music transcription tools, or even predict instrumentation of a piece of music based on a small sample. Most music instrument recognition research to date uses sound separation or other classification techniques. The recent release of the NSynth dataset, however, potentially constitutes a major breakthrough that can help mature the instrument recognition field. In this paper, I propose a framework for music instrument detection using LSTMs and the NSynth dataset. The NSynth dataset (over 280,000 samples from eleven different instruments) is used as training data and IRMAS (over 2,800 excerpts) is used for testing. A dataset comprising all possible notes of an instrument might allow a model to learn timbre. The model could be applied to a piece of music to identify the instruments present at a specific point in time. To my knowledge this approach has not been feasible before the release of NSynth and thus has not been attempted previously.

For the model implementation, an LSTM is used because one goal for this study would be to identify instruments in a piece of music in sequence or by a specific time in the song. LSTMs have outperformed other models in pattern recognition applications for speech, have been used to model polyphonic music with expressive timing and dynamics, and they have high potential for identifying the sequence of instruments in music. My implementation is similar to the Magenta Polyphony RNN model, where Adam Optimizer is used for hyperparameter optimization. After careful evaluation of the model, future work will involve expanding the datasets to include more instruments and further tuning the model to best fit the data.

Note: This study appeared as a Late-Breaking Demo Session at the 18th International Society for Music Information Retrieval Conference, Suzhou, China.
349Improving transfer using augmented feedback in Progressive Neural NetworksDeepika Bablani, Carnegie Mellon University; Parth Chadha, Carnegie Mellon UniversityLearning faster on a task by utilizing learned representations from previous similar tasks is an active area of research in reinforcement learning. Recently proposed progressive neural networks demonstrate this effectively. We use motivations from reciprocal feedback connections in the visual cortex to augment lateral connections in the progressive neural network architecture. We evaluate our modified architecture on Atari games and show that it improves transfer over the progressive baseline.
354Image Segmentation using Fuzzy c-means algorithmDeeptha, Girish, University of Cincinnati; Vineeta, Singh, University of Cincinnati; Anca, Ralescu, University of Cincinnati;Image segmentation is one of the applications of clustering: pixels that are similar are clustered together
that form regions with specific characteristics. These regions usually correspond to objects in an image.
Image segmentation is most commonly based on texture or color.
Fuzzy c-means (FCM) is a popular clustering algorithm which views clustering as obtaining a fuzzy
partition of the universe of discourse. The effect is that a single data point may belong to more than one
class with a degree in [0, 1], and the sum of its degrees across the partition is equal to 1. Fuzzy c-means
clustering first introduced in [1] as an optimization problem, has been widely used in image segmentation.
However, some of its drawbacks include the fact that, unlike its crisp counter-part, the k-means algorithm,
where the cluster membership (0 or 1) depends on the distance of the data point to the cluster centroid, in
fuzzy c-means the membership degree may not reflect the distance to cluster centroid.
Several authors have experimented with different membership functions, objective functions, and features
in fuzzy c-means.
In this study, two novel methods of updating the degrees of membership are investigated. They take into
account the goodness of clustering and spatial relationships, respectively. The resulting clustering
algorithms are applied to segmentation of different kinds of images.
The membership degree adaptation step of FCM, is instrumental because cluster assignment is made
based on the membership degrees, and therefore, this step plays an essential role in the overall clustering
performance. The membership degree adaptation is based on the distance of the point from the cluster
centroid, in a way which accords with intuition -the closer a point is to a cluster centroid, the higher its
degree of membership to that cluster, but also “supervised” by the silhouette of the data point with respect
to that cluster.
The second approach to adapt the degree of membership investigated in this paper, is inspired from image
processing by incorporating spatial information, as the neighborhood of a pixel is very important, and it
can provide valuable information regarding clustering. For example, sudden pixel value changes in a
neighborhood represent edges and it suggests that the pixel belongs to a different cluster.
357Modeling multi-layered behavior of elderly people in smart homesNegar Ghourchian, Aerial Technologies; Michel Allegue Martinez, Aerial TechnologiesRecent progress in wireless technologies, coupled with state-of-the-art artificial intelligence techniques have created a new era in context aware computing. We are interested in a recent research area around a new generation of device-free smart home technologies, which passively sense, monitor, and track people’s indoor location, presence and movement using off-the-shelf wifi-enabled devices, such as those in homes nowadays. Smart activity recognition, whose goals is to discover insightful patterns in human data, has a broad variety of applications e.g. assisted living, elderly care monitoring and security. In this work we present an unsupervised approach for modeling smart home data to promote independence, well-being and quality of life for older people, especially seniors who live alone in their own home. The existing alternatives to address healthcare requirements of elderly people often rely on camera-based monitoring, wearable devices, or heavy-complex deployments of ambient sensors. Vision-based approaches raise serious privacy concerns when it comes to constantly monitoring people’s personal lives. Moreover, heavy deployment of ambient sensors and wearable devices, lead to practical problems including the need for cooperation from subjects, privacy concerns, and high implementation/maintenance cost. For instance, for elderly people or patients with limited mobility, wearing/carrying an external device 24/7 is uncomfortable and infeasible. On the other hand, most of existing works in this field adopted supervised setups, where limited amount of labeled data is gathered, while users are asked to perform predefined and staged actions.
In this work, we present a multi-layered unsupervised modeling approach for learning complex aspect of real-life human activities, routines, and preferences. The proposed algorithm is designed and implemented in real homes for real subjects, and the data is collected with the device-free sensing infrastructure. We employ Bayesian nonparametric models for discovering hidden structures, i.e. clusters, in a collection of complex activities. Inspired by topic modeling problem, where each topic is defined as probability distribution over words, while documents are viewed as distributions across topics, we consider each complex activity as a mixture of actions, locations and temporal characteristics. For example, activity of “taking shower”, can contain several segments, e.g. “walking, standing in bathroom in morning”. The main component of our working dataset is wifi physical layer statistics, which characterizes all distortions of wifi signal propagation in indoor environments, and is shown to reflect human movements and activities, as well as the location of these events. Therefore, location and basic activity clusters/topics are built from the same signal, during different learning processes. Activities are shown to be highly correlated to specific locations at home. Moreover, temporal dimensions can significantly enhance the activity recognition process, and are introduced to the model as prior knowledge over distribution of the discovered topics. Room-level indoor localization is performed using Latent Dirichlet Allocation (LDA), which is a generative statistical model that needs to know the number of clusters, assuming at each residential unit, the number of rooms are fixed and known a priori. For basic activity recognition, Hierarchical Dirichlet process (HDP) is employed that allows the data to determine the complexity of the model, i.e. number of activity clusters, as new, unseen activities appear in the data. The clusters generated from previous layers, locations and basic activities, are the building blocks of the next layer; complex activities discovery, event topics, which help to distinguish routines from abnormal events. The last layer of the learning process is an online version of HDP that learns the complex event topics in real-time, by considering the sequence and duration of the features (basic activities and locations), beside their frequency of occurrences and mixing proportion.
The main initiatives for our elderly monitoring application includes 1) recognizing granular activities such as toileting and showering, 2) detecting and modeling daily routines and abnormal activities (e.g., Falls detection/predictions), 3) evaluating sleep quality using motion detection and breathing rate monitoring, 4) smart interactive system with family members through user interfaces.
367Semantic Seq2Seq Model for Generative Question-Answering SystemXinyu, Liu, Singapore University of Technology and Design (*This work is completed during an internship attachment with SAP Innovation Center Singapore.) ; Mohammad, Akbari, SAP Innovation Center Singapore The exponential growth of information on Community-based Question Answering (CQA) sites has raised challenges in providing high-quality answers for the given questions accurately. Although retrieval-based approaches are prevailing in Question-Answering (QA) systems, they are heavily bounded by the breadth of databases, which are costly to collect and maintain in terms of both time and memory. Recently, generative QA systems have demonstrated their effectiveness in modeling the semantic context of questions and answers while handling the ambiguity of queries and sparsity of data. However, existing approaches fail when faced with user expectation of more human-like interaction and question answering in multi-turn conversations. To tackle these challenges, in this paper, we propose a generative Semantic Sequence-to-Sequence Model, aiming to understand queries posted in human language and communicate expert knowledge as a conversational system.
Our proposed method split the task into three stages of: embedding pretraining, semantic matching and answer translating. Firstly, a dictionary connecting words and vectors are constructed based on Facebook FastText word-to-vector algorithm (Joulin et al., 2016). Each given sentence is embedded into a vector in R^(n×m) space where n denotes the maximum number of words to consider and m represents the length of pretrained word vectors. In our experiment, 15 and 100 are the chosen maximum sentence length and word dimension. Subsequently, the embeddings are processed by a single LSTM (Long Short-Term Memory) network with 128 hidden units to produce a one-dimensional representation for each sentence. There are two separately trained LSTM cells of the same configuration for queries and answers, each labelled the query encoder and answer encoder. A sematic space is built up by maximizing the exponentiated similarity function among all paired queries and answers. With adequately huge and diversified training data, the LSTM outputs for any new paired query and answer can be taken as almost equivalent and the semantic space is the vector space of all such one-dimensional LSTM output vectors. Finally, since vectors in the semantic space can be generated by the LSTM network from two-dimensional arrays, it is possible to restore the original arrays given the outputs using another LSTM network. A decoder LSTM with 128 hidden units is thus created to approximate the true answer embedding given its vector in the semantic space. The decoder LSTM returns a vector reshaped to 100 at each step, which can be translated to a word in the dictionary by minimizing the distance step-wise. The sentences formed by those chosen words in sequence are the answer generated to given queries. To improve on the syntax of generated answers, the decoder can be pretrained with random human language sentences, given a well-trained answer encoder from the previous stage.
Although motivated by answer ranking techniques and Sequence-to-Sequence (Seq2Seq) model (Sutskever et al., 2014), our method differs from each in significant aspects. Compared to Information Retrieval techniques, more flexible and interactive answering systems are enabled. Concurrently we improve on the Seq2Seq model by adding one additional optimizer at the end of encoding stage. Thus, semantic representations of input sequences are learned on top of purely lexical information. The proposed model can accommodate advanced tasks of conversational agents by replacing the LSTM network with more robust alternatives. For example, multi-turn question answering can be delivered through incorporating an attention network as the context. This approach can be generalized to fit any paired sequences of text, allowing potential applications in machine translation, sentence paraphrasing and dialog systems. We conducted experiments on Quora Question Pairs and Jokes (question answering) data sets, two large-scale public data sets on paired texts. The number of records used for training was more than 149 thousand. Experimental results demonstrated that the semantic matching component can select top-10 relevant answers out of more than 24,877 candidates with accuracy of 80.23% and the decoder component can generate sensible responses for a given question.
370Proper Name Pronunciation in Some African LanguagesIbukunola Modupe, The Vaal University of Technology; Sigrid Ewert, University of the Witwatersrand; Mpho Raborife, University of The Witwatersrand.The pronunciation of proper names in African languages such as Yoruba (in Nigeria) and Sepedi (in South Africa) continue to be a major challenge in speech synthesis. The change in tone is fundamental means of communication and cultural transmission that exist among families. This is compounded with an effort to resolve the ambiguity to automate the generation of such languages. Generally, the individual syllable or subject in any of the African languages
can be contrastive, lexically and grammatically categorized as a verb or noun tense. The fundamental questions in this study are what need to be done to understand the pronunciation of proper names in some African languages, why should accurate pronunciation of proper name be so hard and how do we know that the pronunciation of proper names is acceptable?
This study proposes to use a joint-sequence model (JSM) with embedded stress assignment based on language identification (LID) as unsupervised model to find an automatic and efficient way to improve the pronunciation model for proper names and compare it result to long short-term memory (LSTM) recurrent neural network as well as with other approaches and previous studies. We believe that this research study will offer confirmations to comprehend the orthography of African language names that can be used to improve pronunciation dynamically.
The effectiveness of the model will be demonstrated qualitatively and quantitatively by developing corpora as part of this work for African multilingual name pronunciation (AfriMultipron corpus) that contains names in two languages (Yoruba and Sepedi) produced by literate speakers of the particular languages.
375Towards Human-Like Holistic Machine Perception of Affective Social BehaviourYue, Zhang, Imperial College London, U.K.; Felix Weninger, Nuance Communications, Germany; Yifan, Liu, Imperial College London, U.K.; Björn, Schuller, Imperial College London, U.K.The fascination of Artificial Intelligence (AI) is more than a practical one; we express a deeper affiliation with affective AI that is able to communicate with us using emotional and social signals via non-verbal channels. Within the realm of Affective Computing and Social Signal Processing, machine learning research has aimed at more natural human-machine communication by endowing machines with human-like perceptual and analytical abilities. For the holistic analysis of affective states, the first obstacle to overcome is the general scarcity of multi-label databases. Compounded with the problem of label scarcity, one major shortcoming in current research is that emotion representations are considered in isolation, yet, strong interdependencies between various categorical, dimensional, and appraisal-based emotion concepts exist.
In this work, we advocate the usage of multi-task shared-hidden-layer deep neural networks (MT-SHL-DNN) for holistic affect sensing. The efficacy of the novel approach is demonstrated on the example of acoustic emotion recognition. To this end, the feature transformations in the hidden layers are made common for all emotion description schemes, while the softmax layers functioning as log-linear classifiers are separately assigned to each recognition task. In this way, we achieve large-scale data aggregation without any information loss owing to label mapping or discretisation. On nine frequently used emotional speech databases, we demonstrate that the proposed method outperforms the single-task DNNs that are trained with only one emotion scheme.
376Robust Parsing for Ungrammatical SentencesHoma B. Hashemi, University of PittsburghNatural Language Processing (NLP) is a research area that specializes in studying computational approaches to human language. However, not all of the natural language sentences are grammatically correct. Sentences that are ungrammatical, awkward, or too casual/colloquial tend to appear in a variety of NLP applications, from product reviews and social media analysis to intelligent language tutors or multilingual processing. In this research, we focus on statistical parsing, because it is an essential component of many NLP applications. We investigate in what ways the performances of statistical parsers degrade when dealing with ungrammatical sentences. We also hypothesize that breaking up parse trees from problematic parts prevents NLP applications from degrading due to incorrect syntactic analysis.

A parser is robust if it can overlook problems such as grammar mistakes and produce a parse tree that closely resembles the correct analysis for the intended sentence. We develop a robustness evaluation metric and conduct a series of experiments to compare the performances of state-of-the-art parsers on the ungrammatical sentences.
The evaluation results show that ungrammatical sentences present challenges for statistical parsers, because the well-formed syntactic trees they produce may not be appropriate for ungrammatical sentences.

We also define a new framework for reviewing the parses of ungrammatical sentences and extracting the coherent parts whose syntactic analyses make sense. We call this task parse tree fragmentation. We propose a training methodology for fragmenting parse trees without using a task-specific annotated corpus. We also propose two automatic fragmentation strategies that jointly parse the ungrammatical sentence and prune the incorrect arcs: a parser retrained on a parallel corpus of ungrammatical sentences with their corrections, and a sequence-to-sequence deep neural network method.

We evaluate parse tree fragmentation methods on two extrinsic tasks -- fluency judgment and semantic role labeling in two domains of ungrammatical sentences: English-as-a-Second Language (ESL) and machine translation (MT). Experimental results show that the proposed strategies are promising for detecting incorrect syntactic dependencies as well as incorrect semantic dependencies; they also suggest that the overall framework is a promising way to handle syntactically unusual sentences.
391Abandoned Object Detection using Scale Invariant Local Ternary Operator and pixel-based Finite State MachineChinmayee Athalye, College of Engineering Pune; Devadeep Shyam, Nanyang Technological UniversityDue to the increasing number of surveillance devices, and the ensuing data flood, it has become impossible to manually process all the feeds from a surveillance camera. In this work, we have developed a scalable, novel, and robust framework for automatic detection of abandoned, stationary objects that can pose a security threat, from surveillance videos.

We use a hybrid background modeling method sViBe – a combination of the Visual Background Extractor (ViBe) and the Scale Invariant Local Ternary Operator (SILTP) – to model the background. The main advantage of using this method is that the SILTP makes it robust to illumination changes occurring in the scene. The use of ViBe gives us control over the learning rate used in the background modeling. We use two different models corresponding to two different learning rates – a long-term model with a slow learning rate and a short-term model with a higher learning rate. A foreground object is absorbed into the background faster in the short-term model whereas it’s detected as foreground for a longer time in the long-term model. We take into consideration this difference between the two models and the temporal transition of the pixel states to detect stationary foreground objects using the Finite State Machine (FSM). The novel design of the FSM makes this framework robust to temporary occlusions.

We employ certain spatio-temporal conditions on the detected stationary objects to check for abandonment. These checks make sure there are no false positives arising from temporarily unattended luggage. To further remove the false positives, we use single shot multibox detector (SSD) to classify the detected object as person or bag. This framework was tested on all the standard, benchmark datasets for abandoned object detection. It matches the performance of some and outperforms most of the existing state-of-the-art methods for abandoned object detection.

(Submitted abstract also contains pictures)
403Deep Semi-Supervised Learning with Virtual Adversarial Ladder NetworksSaki Shinoda, Prediction Machines; Daniel Worrall, University College London; Gabriel Brostow, University College LondonSupervised learning requires a labeled dataset, which can be expensive and time-consuming to create. Semi-supervised learning (SSL) partially circumvents this high cost by augmenting a small labelled dataset with a large and relatively cheap unlabelled dataset, drawn from the same distribution. We propose a new unifying interpretation of two apparently different deep-learning based SSL approaches (ladder networks [1] and virtual adversarial training [3]), allowing us to fuse the two and achieve near-supervised accuracy on the MNIST dataset using just 5 labels per class.
Our novel interpretation is that these two methods perform distributional smoothing over their respective latent spaces to share statistical information between labelled and unlabelled representations. For both supervised and unsupervised tasks, the ladder network uses a single autoencoder-like architecture with added skip connections from encoder to decoder. For labelled examples, the encoder is used as a feed-forward classifier, and for the unsupervised task, the full architecture is used as a denoising autoencoder, with extra reconstruction costs on intermediate representations. Denoising is applied to encoder activations with additive Gaussian noise, which we interpret as applying isotropic smoothing to the hierarchy of latent spaces modelled by the ladder network. Virtual adversarial training (VAT) augments the training set with virtual adversarial perturbations [3], which are chosen to maximize the KL-divergence between the output distribution conditioned on the perturbed and unperturbed examples. We interpret VAT as anisotropic smoothing over the output probability distribution.
Fusing these two approaches, we propose a new class of deep models for semi-supervised learning that applies anisotropic smoothing to hierarchical latent spaces in the direction of greatest curvature of the unsupervised loss function. We investigate applying a virtual adversarial training cost in addition to classification and denoising costs; and alternatively injecting virtual adversarial noise into the encoder path. We train our models with 5, 10, or 100 labelled examples per class from the MNIST dataset. We evaluate performance on both the standard test set and adversarial examples generated using the fast gradient method [4].
We find that our models achieve state-of-the-art accuracy with high stability in the 5- or 10- labels per class setting on both normal and adversarial examples based on MNIST. Our best model, ladder with layer-wise virtual adversarial noise (LVAN-LW), achieves 1.42% +/- 0.12 average error rate on the MNIST test set, in comparison with 1.62% +/- 0.65 reported for the standard ladder network [2]. On adversarial examples, LVAN-LW trained with 5 examples per class achieves average error rate 2.41% +/- 0.30 compared to 68.60% +/- 6.51 for the ladder network and 9.91% +/- 7.54 for VAT.

[1] Valpola, H., 2014. From neural PCA to deep unsupervised learning. arXiv preprint arXiv:1411.7783.
[2] Rasmus, A., Berglund, M., Honkala, M., Valpola, H. and Raiko, T., 2015. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems (pp. 3546-3554).
[3] Miyato, T., Maeda, S.I., Koyama, M. and Ishii, S., 2017. Virtual Adversarial Training: a Regularization Method for Supervised and Semi-supervised Learning. arXiv preprint arXiv:1704.03976.
[4] Goodfellow, I.J., Shlens, J. and Szegedy, C., 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
434Personalized Workload Assignment in Software Development —— A Two-level Hybrid ApproachNan Wang, Fordham UniversityPeople analytics is gaining popularity because it is expected to eliminate biases that exist in all sorts of people-related issues including recruitment and performance evaluation, promotion and compensation, as well as talents assessment and development. We propose applying the technique of recommender systems to optimize talent usage by personalizing workload assignment. In the paper, we demonstrate the feasibility of this approach in software development environment. We introduce a two-level hybrid (2LH) approach to build the recommender system. We empirically validate its predictive accuracy and recommendation effectiveness even on sparse data. Major merits of this approach include scalability, flexibility and sustainability.
2LH has two base hybrids: organizational network analysis (ONA) based hybrid and graph projection based hybrid. We demonstrate ONA as an effective user profiling way and a solution for sparsity. We discuss common performance metrics and introduce a customized one: frequency of the system failing to recommend at least 10 items (Rl<10).
Lastly, we discuss limitations of 2LH and its future extensions as realistic applications to drive business value.
445Generalizing Robust Covariate Shift PredictionAnqi LiuThe covariate shift learning setting relaxes the widely-employed independent and identically distributed (IID) assumption by allowing different training and testing input distributions. Unfortunately, common methods for address- ing covariate shift by trying to remove the bias between training and testing distributions us- ing importance weighting often provide poor performance guarantees in theory and unreliable predictions with high variance in practice. Recently developed methods that construct a predictor that is inherently robust to the difficulties of learning under covariate shift are often too conservative when faced with high- dimensional learning tasks. We introduce a generalization of robust covariate shift classification that allows the influence of covariate shift to be limited to different feature-based views of the relationship between input variables and example labels. We demonstrate the benefits of this approach on classification under covariate shift tasks.
446Canonical Autocorrelation Embeddings for Comatose Patient CharacterizationMaria De-Arteaga*, Carnegie Mellon University; Peter Huggins , Carnegie Mellon University; Jonathan Elmer, University of Pittsburgh Medical Center; Gilles Clermont, University of Pittsburgh Medical Center; Artur Dubrawski, Carnegie Mellon University;In this work we present Canonical Autocorrelation Embeddings, a method for embedding sets of data points onto a space in which they are characterized in terms of their latent multivariate correlation structures, and where a distance metric enables the comparison of such structures.

This methodology is particularly fitting to tasks where each individual or object of study has a batch of data points associated to it, e.g., patients for whom several vital signs or multiple channels of brain activity are recorded over time. Canonical Autocorrelation Analysis is used for finding multiple-to-multiple linear correlations within the batch of data associated to each individual, yielding a new representation of the data in a space where we define a metric to measure distance between the discovered canonical autocorrelation structures. Within this new feature representation, traditional machine learning algorithms that rely on distance metrics, such as clustering and k-nearest neighbors, can be used, with the caveat that unlike traditional settings where each individual is represented by a single data point, in this case each individual is represented by a set of CAA objects in the newly defined space.

We apply the resulting methodology for characterizing brain activity of comatose survivors of cardiac arrest. Clinicians routinely face the ethically and emotionally charged decision of whether to continue life support for such patients or not. Both scenarios have potentially grave implications on patients and their close ones, so regardless of whether they believe they have enough information, clinicians are often forced to make a prediction. Currently, prognosis relies in great part on observed patterns of EEG activity. Correlations are known to play an important role, as strong correlations between channels are indicative of a poor neurological state. However, doctors have identified that correlations are relevant beyond what is evident to the human eye, with theta-coma and alpha-coma being examples of cases in which the EEG recording appears to show healthy variability between channels but, after some preprocessing, it becomes evident that the brain activity is dominated by simple, strong, cyclic patterns.

Our results show that we can identify with high confidence a substantial number of patients who are likely to have a good neurological outcome. Providing this information to support clinical decisions could motivate the continuation of life-sustaining therapies for patients whose data suggest it to be the right choice.
447Computing Approximate Frequent Items in a Distributed EnvironmentAnkush Mandal, Georgia Institute of Technology; He Jiang, Rice University; Anshumali Shrivastava, Rice University; Vivek Sarkar, Georgia Institute of TechnologyCounting and identifying frequently occurring items is one of the most important and intuitive metric to gain insight into large-scale data. Modern large datasets occupy significant space and the associated frequency histograms are prohibitively heavy to store and communicate across distributed nodes. Probabilistic data structures are popular for approximating heavy hitters to further reduce the memory footprint and computational complexity, with statistical guarantees of errors. However, existing randomized algorithms are not capable of exploiting both node level and distributed parallelism efficiently. As the consequence, despite being theoretically superior, we do not see randomized algorithmic implementations achieving impressive performance in distributed counting benchmarks. Our proposed algorithm and implementation changes this fact.

Bloom filters, Count-Min sketch (CM sketch), and Lossy Counting are commonly used probabilistic data structures. CM sketch is the most popular algorithm because it can exploit multi-core parallelism on each node, which is not satisfied by Lossy Counting. CM sketch creates a tiny sketch data structure represented by a two-dimensional array of w columns and d rows, and has d hash functions. To update the sketch with each incoming item X of size n, for each row j, apply hash function h_j to obtain a column index k=h_j(X), and increment the sketch value in row j, column k by n. For a query asking the count of an item Y, the sketch returns the minimum value of column h_j(Y) among all the rows j. The error guarantee relies on the assumption that heavy hitters are rare in real-world data, and the algorithm only makes mistakes when two or more heavy hitters are hashed into the same bucket in all rows.

CM sketch, however, does not directly enable extracting the top-k most frequent items, because the identity of the hashed items are not stored and hash functions are typically not reversible. Keeping track of the identity of the streamed data defeats the purpose of a compact summary data structure and is not practically feasible for large-scale datasets.

We propose a novel algorithm and its highly optimized implementation over MPI. Our algorithm exploits both local and distributed parallelism and fixes the issues with the existing widely applied algorithms. Furthermore, our proposal only requires a tiny amount of memory and communication. Our experiments indicate that the proposed MPI implementation can be significantly faster than state-of-the-art alternatives for the task of finding frequent items in distributed settings.
450SimplerVoice: A Key Message & Visual Description Generator System for IlliteracyMinh N.B. Nguyen, University of Southern California; Samuel Thomas, IBM Watson; Anne E. Gattiker, IBM Research; Sujatha Kashyap, IBM Research; Kush R. Varshney, IBM Research AI757 million adults worldwide could not read this sentence, nor comprehend its complex construction even if spoken aloud. In this work, we introduce SimplerVoice: a system that is able to generate key message, and visual description to support low-literate adults in navigating the information-dense world with confidence, on their own.

Previous study in this field (Schroff et al., 2011) proposed an approach to harvest a large number of images automatically for specified object classes which downloads all website contents from a Web search query, then, remove irrelevant components, and re-rank the remainder. However, the study did not work on action-object interaction classes, which might be needed to describe an object. Another work is to link the text sequence to a database of pictographs. (Vandeghinste et al., 2015) introduced a text-to-pictograph translation system that is used in an on-line platform for Augmentative and Alternative Communication. The text-to-pictograph was built, and evaluated on email text messages. Recently, there have been studies that proposed to use deep generative adversarial networks to perform text-to-image synthesis (Reed et al.,2016; Zhang et al.,2016); but nevertheless,these techniques might still have the limitation of scalability, or image resolution restriction.

In this work, we propose to use cognitive technology along with natural language processing, and other information retrieval techniques in SimplerVoice. SimplerVoice can automatically generate sensible sentences describing an unknown object, extract semantic meanings of the object usage in the form of a query string, then, represent the string as multiple types of visual guidance (pictures, pictographs, etc.). The system consists of 4 main components: input retrieval, object2text, text2visual, and output display. We introduce an ontology-based method to perform word-sense-disambiguation in generating text (object2text), and an approach to retrieve optimal visual components (text2visual) combining photorealistic images, and pictographs for illiteracy by utilizing web search engine, and existing ontology based on WordNet.

To evaluate the system, we demonstrate SimplerVoice in a case study of generating grocery products' manuals through a mobile application. The application was provided for low-literate end-users, and was received positive feedbacks. Our result shows that SimplerVoice can provide illiterate users with simple yet informative components to help them understand how to use the grocery products, and that the system may potentially provide benefits in other real-world use cases.
456Imagined Speech Classification using EEG signalsRishika Agarwal, University of Illinois at Urbana Champaign; Piyush Rai, Indian Institute of Technology Kanpur; Lakshmidhar Behera, Indian Institute of Technology KanpurOne of the primary aims of a Brain Computer Interface is to provide basic functionality to physically disabled patients, for example stroke/paralysis patients. Mental Speller is one such application, which enables the patient to type his message simply by concentrating his/her thoughts as required, without
the use of any physical movements. In principle, if we can identify the syllables/phonemes a person is imagining, we can string together the word composed of those phonemes, and build a mental speller.

Imagined Speech is a relatively new area of research in Brain Computer Interface. An efficient implementation of it can be used to build an easy-to-use Mental Speller. In this project, we attempted to investigate the usability of EEG for this purpose. We recorded EEG signals from 16 channels, (markedly
fewer number of channels, compared to 128 channels, which most of the previous works have used) to capture the brain activity of a person, and used it to identify the phoneme the participant is imagining. It was a binary classification problem, the 2 classes being ’ba’ and ’ku’. The signals were pre-processed to remove artifacts and noise, and two different paradigms of treating multi-modal data were investigated. Finally, standard Machine Learning classifiers were used to classify the phonemes.

Pre-processing involved performing an Independent Component Analysis on the EEG signals, to calculate the source vectors from the sensor vectors. Thereafter, Hurst Exponent was used to identify which components of the source vector corresponded to noise/artifacts, and which components corresponded to relevant neural activity. On the processed vector thus obtained, two broad approaches were investigated: Matricization of the Feature Tensor, and Tensor Factorization. EEG data is multi-modal, ie, it has multiple dimension or modes of data; In this case, there were three dimensions : No. of samples or recordings (N), no. of time frames in one sample (T), and no. of channels (C). Matricization implies flattening of the multi-modal data tensor, such that the data is in the form of a matrix of rows representing different samples and columns representing features. However, flattening the data tensor leads to a loss
of multi-modal structure, which results in poor classification accuracies using standard classifiers. Thus, tensor factorization techniques like Canonical Polyadic Decomposition are needed. We performed CPD on our multi-modal tensor, and heuristically chose R, the number of rank one tensors which sum up to the data tensor, along with an error tensor. The rank one tensors are the outer products of component matrices. The component matrices can be considered as embeddings of different modes, and hence can be used as feature vector matrices for standard classifiers. Significantly higher classification accuracy
(using standard classifiers like Perceptron and SVM) was obtained from tensor factorization than from Matricization, which highlighted the importance of maintaining the inherent multi-modal structure of EEG data.
460A Deep Learning Approach for Alzheimer's Disease Detection and Classification from Brain MRI DataJyoti, Islam, Georgia State University; Yanqing Zhang, Georgia State University Jyoti Islam*, Georgia State University; Yanqing, Zhang, Georgia State UniversityThis paper presents a novel deep learning model for multi-Class Alzheimer’s Disease detection and classification using Brain MRI Data. Alzheimer’s Disease (AD) is a neurological brain disorder which destroys brain cells causing people to lose their memory, mental functions and ability to continue daily activities. Though AD is not curable, but early detection and classification of AD are critical for proper treatment and preventing brain tissue damage. Machine learning techniques can vastly improve the process for accurate diagnosis of AD. In recent days, deep learning techniques have achieved major success in medical image analysis. But relatively little investigation has been done to applying deep learning techniques for AD detection and classification. We design a very deep Convolutional Neural Network (CNN) and demonstrate the performance on the Open Access Series of Imaging Studies (OASIS) database[1].

There are three major stages in Alzheimer’s Disease - very mild, mild and moderate. Fig. 1 shows some brain MRI images presenting different AD stage. Extensive knowledge and experience are required to distinguish the AD MRI data from the aged normal MRI data. Researchers have developed several computer-aided diagnostic systems for AD detection. Most of them use handcrafted feature generation and extraction from the MRI data [2], [3]. After that, the features are fed into machine learning models such as Support Vector Machine, Logistic regression model, etc. These multi-step architectures are complex, time consuming and highly dependent on human experts. But our proposed deep CNN do not need hand-crafted features. Besides, a large dataset is crucial for developing a robust deep neural network. But neuroimaging studies datasets are typically small. So it is important to develop models that can learn useful features from the small dataset. We have used transfer learning to overcome this issue and pre-trained the model with ImageNet database.

Our model is inspired by Inception-V4 network [4]. To fit the MRI data, we have designed the input size of our network as 299*299*1 and modified the Inception B and C module. The softmax layer has four different output class: nondemented, very mild, mild and moderate AD. The classifier takes an MRI image as input and extracts layer-wise feature representation from the first stem layer to the last drop-out layer. Based on this feature representation, the input MRI image is classified to any of the four output classes.

We have provided an one step analysis for AD detection and classification using brain MRI data. The current accuracy of our method is 73.75%. The proposed model is much faster and takes less than 1 hour to train and test the OASIS dataset. This performance is superior than all previous traditional methods. Currently, we are working with different hidden layers and convolutional filters to do more optimization to find a more efficient model. In future, we hope to work with other MRI AD dataset such as ADNI and achieve similar or better performance. Note: this work was accepted for publication in BI’17 as a full research paper.

[1] D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris, and R. L. Buckner, “Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults,” Journal of cognitive neuroscience, vol. 19, no. 9, pp. 1498–1507, 2007.
[2] B. Magnin, L. Mesrob, S. Kinkingn´ehun, M. P´el´egrini-Issac, O. Colliot, M. Sarazin, B. Dubois, S. Leh´ericy, and H. Benali, “Support vector machine-based classification of alzheimers disease from whole-brain anatomical mri,” Neuroradiology, vol. 51, no. 2, pp. 73–83, 2009.
[3] J. H. Morra, Z. Tu, L. G. Apostolova, A. E. Green, A. W. Toga, and P. M. Thompson, “Comparison of adaboost and support vector machines for detecting alzheimers disease through automated hippocampal segmentation,” IEEE transactions on medical imaging, vol. 29, no. 1, p. 30, 2010.
[4] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inceptionv4, inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261, 2016.
466Medical Image Analysis using a Novel Learning Algorithm based on Plant IntelligenceDeblina Bhattacharjee, Samsung POSTECHManual analysis and classification of blood smear images is a) time consuming b) prone to error due to different morphological features of the cell c) less precise due to inter and intra-observer variability and d) sensitive to labor intensive routine procedures. Recently, computational intelligence techniques like gradient based optimization methods using first order information like stochastic gradient descent with support vector machines, search based optimization like genetic algorithm, simulated annealing, electromagnetic optimization with fuzzy cellular neural network (FCNN) and other hybridized methods have been used to solve this problem. However, these techniques failed to report precise solutions in images having deformed, complex and hidden cell features specially under noisy conditions, making it impossible to process normal resolution images. Interestingly, the FCNN gave distorted contours of detection with exponential increase in number of iterations.

Thus, we propose a novel learning algorithm based on the intelligent decision making of biological plants to solve the object detection problem in blood smear images. We have modelled the intelligence of biological plants (the crux of our research), and have applied it to solve above stated image processing problems. The motivation lies in the fact that plants have 13 senses as opposed to 5 human senses and can use them effectively to process information such that it guarantees their survival in dynamic environments, despite lacking a neural structure and a brain. The algorithm makes use of the plant growth simulation algorithm (PGSA) and a two-dimensional learning network that uses reinforcement as a feedback signal.

PGSA is used as the optimization technique in the learning algorithm as it gives higher rate of accuracy and faster global optimization while giving stable solutions with a well-balanced exploration to exploitation ratio. The PGSA optimizes the parameters of the learning algorithm which has a weighted action network and an evaluation network. The action network has the set of all initial values for the weights of what action needs to be taken and the evaluation network forms a mapping from the inputs to the fitness value of the current solution. The learning algorithm is used as a sub pixel detector for solving the object detection problem in blood smear images. Hereunder, an example figure of the process is shown, where (a) given smear image, (b) is segmented to remove uninteresting features using histogram thresholding, (c) Canny filter is used for edge map (d) the learning algorithm initializes the candidate solutions and starts matching pixels of the current candidate with the edge map of the image blood cells. Red pixels show region of maximum overlap and finally (e) PGSA optimizes the solution and gives the global best as the output.

We analyze performance metrics like precision, sensitivity and noise resistance for the proposed approach and all other state-of-the-art approaches in computer vision till date for solving such problems, achieving an ROC of 0.9828 and precision of 96.12% under noisy conditions while detecting hidden cells too.
481Learning Human-Driver Behaviors using Inverse Reinforcement Learning to Enable Autonomous Highway-Merging in Self-Driving CarsAnuja Nagare, University of Georgia ; Prashant Doshi, University of Georgia The first working models of autonomous cars were built in the 1980’s. Since then, companies and research organizations including Mercedes-Benz, Google, Tesla, Uber, Toyota and many more are breaking a lot of ground in this field to build a prototype for autonomous cars to navigate roads safely. In order to facilitate this, an autonomous car must account for the uncertainty both in the environment and in the driving patterns of other cars (manned or unmanned).
Behaviors of other drivers and their preferences are typically unknown apriori. Therefore, learning them is the key to predicting what these drivers may do in the future. A useful machine learning technique known as Inverse Reinforcement Learning (IRL) enables us to recover the other driver’s preference function based on his/her observed driving patterns either online or in a simulation. We can then use the learned preference function to complete the other driver’s model specification and use this model for predicting what he/she might drive in the future.
IRL assumes that the world is modelled as a Markov Decision Process (MDP). MDP is a tuple <States, Actions, Transitions, Discount Factor, Rewards>. MDP models decision making in a stochastic environment and the core problem is to find an optimal policy (mapping of States to Actions) for the decision maker (i.e. self-driving car which is trying to merge onto the highway from an acceleration ramp). There are various policy optimization techniques used to determine optimal policy like policy iteration, policy gradient, etc.
The research focus is to present a novel method to provide a solution for highway lane merging problem. In case of a Highway Lane Merging problem, the self-driving car which is trying to merge onto the highway from an acceleration ramp attempts to learn the behavior of cars driving on the outermost highway lane near the merging area in order to merge safely onto the highway.
We have designed state space, action space and transition function for the underlying MDP model. We are working on improving the existing baseline by devising a new IRL algorithm which is based on Bayesian IRL (Ramachandran, D., & Amir, E. (2007)). Our evaluation involves real-world traffic data from a congested freeway in California (NGSim dataset). Driving patterns are extracted from the dataset for each car, every instance of a driving pattern has a total of 18 attributes (eg. x-y coordinates, velocity, acceleration, etc.). Driving pattern of a car driving on the outermost highway lane near the merging area is called as “expert’s trajectory”. Experts trajectory can be thought of as a series of state-action pairs in a Markov Decision Process.
The scope of our work is to enable self-driving cars to safely merge onto a highway consisting of cars, by explicitly learning expert’s reward structure for the underlying MDP to construct a policy (𝜋: 𝑆→𝐴).
485In Search of Long-Period Comets with Deep Learning Tools Susana, Zoghbi, SETI; Marcelo, De Cicco, SETI; Antonio, Ordonez, SETI; Andres, Plata Stapper, SETI; Jack, Collison, SETI; Peter, Gural, SETI; Siddha, Ganju, SETI; Jose Luis, Galache, SETI; Peter, Jenniskens, SETI The aim of this paper is to provide Deep Learning tools to aid the search for debris of long-period comets. Due to their large size and fast traveling speeds of up to 70 km/s, and despite the rarity of their impacting Earth, long-period comets are recognized as potentially the most devastating impact threat to our planet. Evidence indicates that the impact of a comet or asteroid, having a diameter of about 10 km, was responsible for the mass extinction of most species of dinosaurs 65.5 million years ago. However, any new comet on an impact trajectory with Earth would likely only be discovered 6-12 months before impact, when it becomes visible as the Sun's heat and wind start sublimating its icy surface and ejecting rocky debris.
The orbits of such debris can be used by astronomers to guide the search for comets while they are still far out, providing us years of extra warning time in case of a collision path. Most suitable for this are the rare aperiodic meteor showers from debris by a comet in a previous visit to the inner Solar System. Detecting those showers requires a continuous and global search.
The Cameras for Allsky Meteor Surveillance (or CAMS) is a network of low-light video cameras, established by the SETI Institute in different locations across the globe, that monitors the sky to detect meteors. Until now, processing the images collected by CAMS has required time-consuming human input. On an average night, an astronomer receives -per camera- around 500 detections consisting of images and light intensity curves (a sequence of measurements of how light intensity changes as detected objects move in the sky). A total of 8,000 observations with 16 cameras per site.
Figure 1 presents examples of images captured by CAMS. Most of these turn out to be false detections, such as planes, birds, clouds, etc. Sorting through these every night is not scalable. To alleviate this, we automate this process using Deep Learning. To the best of our knowledge, this is the first time that deep learning techniques have been applied to this endeavour.
Specifically, we trained a Convolutional Neural Network (CNN) that discerns images of meteors vs. other objects in the sky. We used five convolutional layers followed two fully connected layers and a binary softmax classifier. Dropout and max-pooling layers were used. We also performed standard data augmentation techniques on instances of the positive class, such as rotation and flipping. Our CNN achieves precision and recall scores of 88.3% and 90.3%, respectively. In addition, we trained a Long-Short Term Memory (LSTM) network that encodes the light curve tracklets into a latent space, and learns to predict whether or not the tracklet corresponds to a meteor. The LSTM achieves a precision of 90.0% and a recall of 89.1%. One key advantage of using Deep Learning is that we did not have to hand-engineer the meaningful features from both images and light curves. The models learned these on their own.
We did a qualitative evaluation by inspecting instances where the network’s predictions are incorrect. False negatives often happen when meteors are very faint and hard to see. False positives tend to occur when there is an object (like a satellite) that looks very similar to meteors. It takes an average of 1.8s. to perform one forward pass of one image on an off-the shelf laptop with no GPU. This makes it very suitable to be deployed on site, where the cameras are located. Compared to human performance, an expert at peak productivity can annotate about one image per second and achieve 99% precision and recall. While we would desire to achieve or surpass human performance, it is not a requirement for the automation success. It is not sustainable for an astronomer to perform this process every day. We can tolerate inaccuracies of up to about 10% given that we can 1) remove the drudgery of human annotation and save time through an automated annotation process, and 2) provide a cleaner data set that feeds the downstream process of calculating meteoroids’ orbits. Ultimately our tools free the astronomer from this low-cognitive task and help the search for debris of potentially dangerous long-period comets.
490Employing Neural Hierarchical Model for Abstractive Text SummarizationWasifa Chowdhury, Simon Fraser University; Fred Popowich, Simon Fraser UniversityAs growth of online data in the form of news, social media, email, and text continues, automatic summarization is integral in generating a condensed form to get gist of the original text. While most of the earlier works on automatic summarization use extractive approach to identify the most important parts of the document, some of recent research works focus on the more challenging task of making the summaries more abstractive, requiring effective paraphrasing and generalization steps. In this work, we propose an encoder-decoder attentional recurrent neural network model to achieve automatic abstractive summarization. Although most of the recently proposed methods have already used neural sequence to sequence models, two issues that still need to be addressed are- how to focus on the most important portions of the input when generating the output words and how to handle the out-of-vocabulary words not contained in the fixed-size target list. Unlike other NLP tasks like machine translation which requires encoding all input information to produce the translation, summarization needs to extract only the key information while ignoring the irrelevant portions that might degrade overall summary quality. We use a hierarchical word-to-sentence encoder to jointly learn word and sentence importance using features like- content richness, salience, and position. During decoding, attention mechanism operates at both sentence and word levels. To address the problem of unknown words, we learn a word-to-character model. We conduct our experiments on CNN/Daily News corpus and provide both quantitative and qualitative analysis.
497Supervised learning for survival analysis: Diabetes type II risk screening in the UK primary care Torgyn Shaikhina, University of Warwick; Margaret Smith, University of Oxford Nuffield Department of Primary Care Health Sciences; Natasha Khovanova, University of Warwick; Mark Leeson, University of Warwick; Alice Fuller, University of Oxford; Claire Bankhead, University of Oxford; Sarah Stevens, University of Oxford; Rafael Perera, University of Oxford Nuffield Department of Primary Care Health Sciences; Tim Holt, University of Oxford Nuffield Department of Primary Care Health SciencesSurvival analysis has received limited attention in the machine learning community, largely due to inherent incompatibility of supervised learning with the censored outcomes inevitably involved in the analysis (Wang et al. 2017). In healthcare, semi-parametric Cox regression and its extensions have remained a dominant technique for longitudinal prognostic modelling since their invention (Cox 1972). Nevertheless, the increasingly mobile population and heterogeneity of the modern electronic health records (EHRs) have motivated the use of machine learning for modelling long-term incidences of multifactorial diseases, such as diabetes.

Presented in this work are the technical aspects of an ongoing clinical and engineering collaboration with an aim to modernise the Diabetes type II risk screening system presently used in the UK primary care. Accurate prognostic models for a 10-year incidence of Diabetes Type II in the UK general population have been developed using ensemble neural network (NN) learning, and auxiliary survival decision trees (DTs). The models were derived and validated with 80,000 EHRs routinely collected across the UK National Health Service general practices. The NN ensemble model demonstrated its competitive performance on censored observations despite not being specifically trained to handle them. It was able to successfully capture the non-linear associations between the physiological indicators and the long-term patient outcomes across the multi-dimensional data. It yielded 85% concordant predictions, notwithstanding the biased covariate estimates introduced by multiple imputation of missing data and a class imbalance of 1:11.

The survival DT model was able to identify the groups of patients at high 10-year risk of Diabetes Type II and pin-point those whose required additional granularity. 82.4% of the DT’s predictions were concordant. Secondary surrogate splits were deployed in order to handle the missing blood glucose information directly by the DT without the need for imputation. Coupled with its easily interpretable structure, the survival DT offered actionable insights for improving the prognostic risk score estimations.

This work additionally presents the strategies for coping with broader limitations inherent in routinely-collected primary care data, including heterogeneity of data types (free text, disease codes, and numerical laboratory test results), class imbalance (incidence of the disease in the general population), sparsity (both planned and accidental), and right-censoring. Secondary work in progress involves hybridising Cox proportional hazard model with a sigmoidal neural network in order to reduce the effect of right-censored data on the model discrimination and calibration. This extension is scalable to other domains of applications, including customer churn forecasting and predictive equipment maintenance.
498Image Segmentation to Distinguish Between Overlapping DNA Chromosomes R. Lily Hu, Salesforce Research; Jeremy Karanowski, Insight; Ross Fadely, Insight; Jean-Patrick Pommier , Jean-Patrick PommierNeural networks are a powerful approach to segmenting images, including for street scenes and biomedical images of tissue. In medicine, visualizing chromosomes is important for medical diagnostics, drug development, and biomedical research. Unfortunately, chromosomes often overlap and it is necessary to identify and distinguish between the overlapping chromosomes. For example, some diseases are associated with particular chromosomes or the existence of more or fewer than the expected number of chromosomes. Challenges to this problem include that the overlapping objects may be nearly identical and that it is arbitrary which object is considered the first object and which one the second. Furthermore, overlapping chromosomes may look like one larger chromosome, may criss-cross, or one may be almost entirely on top of the other. A segmentation solution that is fast and automated will enable scaling of cost effective medicine and biomedical research. Traditional methods of distinguishing between overlapping chromosomes involved printing and cutting out individual chromosomes by hand, thresholding on histogram values of pixels, geometric analysis of chromosome contours, among others, and required human intervention when partial overlaps occur.
In this work, we apply neural network-based image segmentation to the problem of distinguishing between partially overlapping DNA chromosomes. A convolutional neural network, based on U-Net, is customized for this problem. The model is designed so that the output segmentation map has the same dimensions as the input image. To reduce computation time and storage, the model is also simplified. This is because the dimensions of the input image, the set of potential objects in the image, and the set of potential chromosome shapes, are all small, which reduces the scope of the problem, the required capacity of the model, and thus the modeling needs. Various hyperparameters of the model are explored and tested.
The model is deployed on a set of grayscale-images of overlapping chromosomes. The results achieved intersection over union (IOU) scores of 94.7% for the overlapping region and 88-94% on the non-overlapping chromosome regions. Compared to non-deep learning methods, these results are achievable without human intervention during prediction.
504Detecting Behavioral Engagement of Students in the Wild Based on Contextual and Visual DataEda, Okur, Intel Labs; Nese, Alyuz, Intel Labs; Sinem, Aslan, Intel Labs; Utku, Genc, Intel Labs; Cagri, Tanriover, Intel Labs; Asli, Arslan Esme, Intel LabsAbstract – To investigate detection of students’ behavioral engagement (On-Task vs. Off-Task), we propose a two-phased approach: In Phase 1, contextual logs (URLs) are utilized to assess active usage of the content platform. If there is active use, the appearance information is utilized in Phase 2 to infer behavioral engagement. Incorporating the contextual information improved the overall F1-scores from 0.77 to 0.82.

Our goal is to detect students’ behavioral engagement (On- Task vs. Off-Task states) in 1:1 digital learning scenarios. In this study, we have two research questions: (1) What level of behavioral engagement detection performance can we achieve by using a scalable multi-modal approach (i.e., camera and URL logs)? (2) How would this performance change when considering cross-subjects or cross-content platforms (Math vs. English as a Second Language (ESL))?

Monitoring students’ face and upper body (appearance) as well as their interactions with the learning platform (context) provide important cues to accurately understand different dimensions of students’ states during learning. To detect behavioral engagement, we propose a two-phased system: (1) Phase 1: Contextual data (URL logs) is processed to assess whether the student is actively using the content platform. If not (Off-Platform), the student’s state is predicted as Off-Task. (2) Phase 2: If content platform is active in learner’s device, then the appearance information is utilized to predict whether the student is On- Task or Off-Task. We trained generic appearance classifiers for Phase 2 (Random Forests). The frame-wise raw video data is used to extract face location, head position and pose, 78 facial landmark localizations, 22 facial expressions, and 7 basic facial emotions. For instance-wise feature extraction, conventional time series analysis methods were applied, such as robust statistical estimators, motion and energy measures, frequency domain features. Instances are sliding windows of 8-sec with 4-sec overlaps.

170 hours of multi-modal data were collected through authentic classroom pilots, from 28 9th grade students (two different classrooms) in 22 sessions (40 minutes each), using laptops with a 3D camera. Online content platforms for two subject areas were used: (1) Math (watching videos), (2) ESL (reading articles). To obtain ground truth labels, we employed HELP [1] with 3 expert labelers. We
experimented with two test cases: (1) Cross-classroom, where trained models were tested on a different classroom’s data; (2) Cross-platform, where the data collected in different subject areas were utilized in training and testing, respectively. The results for these two experiments are summarized in Table 1 and 2, respectively.

Table 1. F1-scores (%) for cross-classroom experiment (Set1: Classroom1, Set2: Classroom2, Appr: Appearance).
Train / Test / Class / Appr / Context+Appr
Set1 / Set1 / On-Task / 82 / 82
Set1 / Set1 / Off-Task / 69 / 77
Set1 / Set1 / Overall / 77 / 80
Set1 / Set2 / On-Task / 83 / 83
Set1 / Set2 / Off-Task / 63 / 79
Set1 / Set2 / Overall / 77 / 82

Table 2. F1-scores (%) for cross-platform experiment (Set1: Classroom1 with Math, Set2: Classroom2 with Math, Set3: Classroom1 with ESL).
Train / Test / Class / Appr / Context+Appr
Set1+Set2 (Math) / Set1+Set2 (Math) / On-Task / 82 / 82
Set1+Set2 (Math) / Set1+Set2 (Math) / Off-Task / 67 / 78
Set1+Set2 (Math) / Set1+Set2 (Math) / Overall / 77 / 80
Set1+Set2 (Math) / Set3 (ESL) / On-Task / 79 / N/A
Set1+Set2 (Math) / Set3 (ESL) / Off-Task / 59 / N/A
Set1+Set2 (Math) / Set3 (ESL) / Overall / 72 / N/A

Since we have more Off-Platform samples in Set2 than in Set1, which are predicted as Off-Task in Phase 1; using context improves Off-Task scores more in Set2. We believe that the overall performance achieved is acceptable, as the expected accuracy by chance is 0.48, observed accuracy is 0.77, and Cohen’s Kappa is 0.55 for the final models.

To explore scalable multi-modal approach for behavioral engagement detection, we proposed a two-phased system incorporating both visual and contextual cues. Using the context information even in the form of URL logs is rewarding for improving the overall system performance. The promising overall F1-scores show the cross-subject and cross-platform applicability of our models.

[1] S. Aslan, S. E. Mete, E. Okur, E. Oktay, N. Alyuz, U. Genc, D. Stanhill, A. Arslan Esme. 2017. Human expert labeling process (HELP): towards a reliable higher order user state labeling process and tool to assess student engagement. Educational Technology, 57(1), 53-59.
509End-to-End Trained CNN Encoder-Decoder Networks For Image SteganographyAtique ur Rehman, National University of Computer & Emerging Sciences; Rafia Rahim, National University of Computer & Emerging Sciences; Shahroz Nadeem, National University of Computer & Emerging Sciences; Sibt ul Hussain, National University of Computer & Emerging SciencesAll the existing image steganography methods use manu-
ally crafted features to hide binary payloads into cover im-
ages. This leads to small payload capacity and image dis-
tortion. Here we propose a convolutional neural network
based encoder-decoder architecture for embedding of im-
ages as payload. To this end, we make following three major
contributions: (i) we propose a deep learning based generic
encoder-decoder architecture for image steganography; (ii)
we introduce a new loss function that ensures joint end-to-
end training of encoder-decoder networks; (iii) we perform
extensive empirical evaluation of proposed architecture on
a range of challenging publicly available datasets (MNIST,
CIFAR10, PASCAL-VOC12, ImageNet, LFW) and report
state-of-the-art payload capacity at high PNSR and SSIM
Infinite-Stage Dynamic Treatment Regimes under Constraints
Shuping Ruan
In precision medicine research, dynamic treatment regimes (DTRs) are sequential decision making problems for chronic conditions. Most of the current methods for constructing dynamic treatment regimes focus on optimizing a single utility function over a finite number of decision time points (finite horizon). However, clinical situations often, in practice, require considering the trade-off among multiple competing outcomes without a priori fixed end of follow-up point (infinite horizon). Hence, we develop a method of estimating constrained optimal dynamic treatment regimes in chronic diseases where patients are monitored and treated throughout their life. We apply our method to a simulated cancer trial dataset based on a chemotherapy mathematical model, and examine the results of our proposed method.
Towards Taking a Strategy in Training Generative Adversarial Networks
Tatjana Chavdarova, Idiap Research Institute and École Polytechnique Fédérale de Lausanne; François Fleuret, Idiap Research Institute and École Polytechnique Fédérale de Lausanne
The Generative Adversarial Networks (GANs) are a novel generative unsupervised learning algorithm. Contrary to the traditional generative models, this algorithm bypasses the explicit density modeling and leverages the powerful deep learning architectures. GANs have found impressively numerous applications in computer vision. On the other hand, they have gained a reputation for being difficult to train.

We consider an alternative training, named SGAN, in which several "local" adversarial pairs of networks are trained independently so that a "global" pair of networks can be trained using these. The goal in SGAN is to train the global networks with the corresponding ensemble-opponent, for an improved mode coverage. This approach aims at increasing the chances that learning will not stop for the global pair, preventing both to be trapped in an unsatisfactory local minimum or to face oscillations often observed in practice. To guarantee the latter, the global pair never affects the local ones.

The rules of SGAN training are as follows: the global generator and discriminator are trained using the local discriminators and generators, respectively, whereas the local networks are trained with their fixed local opponent.

A thorough experimental evaluation indicates that this kind of training systematically improves upon classical training. In particular, it improves mode coverage, stability, and surprisingly, convergence speed.
Generating Contextual Descriptions of Virtual Reality (VR) Spaces
Danielle Olson, Massachusetts Institute of Technology; Cagri Zakan Haman, Massachusetts Institute of Technology; Ainsley Sutherland, Massachusetts Institute of Technology
Virtual reality holds great potential for science communication, education, and research. However, interfaces for manipulating data and environments in virtual worlds are limited and idiosyncratic. Furthermore, speech and vision are
the primary modalities by which humans collect information about the world, but the linking of visual and natural language domains is a relatively new pursuit in computer vision. Machine learning techniques have been shown to be effective at image and speech classification, as well as at describing images with language (Karpathy 2016), but have not yet been used to describe potential actions.
We propose a technique for creating a library of possible context-specific actions associated with 3D objects in immersive virtual worlds based on a novel dataset generated natively in virtual reality containing speech, image, gaze, and acceleration data. We will discuss the design and execution of a user study in virtual reality that enabled the collection and the development of this dataset. We will also discuss the development of a hybrid machine learning algorithm linking vision data with environmental affordances in natural language.
Robust Gaussian Mixture Models for Anomaly Detection in Time-series IoT Datasets
Shruti Bhargava, UIUC; Purushottam Kar, IIT Kanpur; Nagarajan Natarajan, MSR Bangalore; Praneeth Netrapalli, MSR Banglore
IoT devices have become very popular in varied domains, ranging from border surveillance to patient observatories. However, currently most of these devices are capable only of recording data due to their limited storage and computation power. The current advancements in machine learning to make inferences and predictions from data, including the latest deep networks, are so memory intensive that they are futile unless this data reaches the cloud. It is important to understand that communicating data to the cloud is limited by connectivity requirements, communication bandwidth and primarily excludes the possibility of immediate decision making. Onboard computation and simple processing powers could prove highly beneficial and result in multi-fold increase in the utilisation of IoT. This necessitates reverting to simpler low-memory models capable of operating in constrained settings and innovate to upgrade their performance.

With the increasing presence of IoT devices in security, health domains and human activity tracking, an interesting challenge is to empower them to aid in identifying and preventing unusual situations. For instance, data from human activity patterns can be used to instantaneously identify sickly behaviour or detrimental characteristics and warn a person. Driver’s behaviour can be studied to predict and warn against possible accidents. These are the key motivators for our work. Since the data from IoT devices is in the form of time-series, it contains redundant dependent data and needs to be processed to extract relevant features. We aim to achieve two major objectives: 1) represent time-series data as a time independent feature vector 2) to develop a noise robust unsupervised anomaly detector for IoT devices. To meet the first challenge, we look at a combination of feature representations, drawing from statistics. We extract meaningful features combining central tendencies to summarize a window as one observation. Empirical CDF estimations with heavy tail sampling is chosen to capture the extreme variations in variables. Fourier transformed features on frequency domain preserve the fluctuations in our window. These feature are re-evaluated using benchmark algorithms for clustering and classification. For the second goal, we had to face other hindrances - defining anomalies is quite subjective as it is a relative term, varying with application, and can change considerably with time. Working with IoT devices, we looked at data compression techniques for easier data transferring and processing. We select GMMs as our model for anomaly detection owing to the low memory requirements and test-time processing. We redesign the distribution learning technique of GMMs to efficiently identify anomalies in the data. We propose an iterative elimination algorithm to incorporate robustness in our model i.e. force it to improve its estimate of the normal data and identify the anomalies. In this algorithm, convergence is achieved through the active set convergence analysis involved in Gaussian Mixture modelling as well as decreasing cutoff fractions. We assess the algorithm using both synthetic as well as real-life datasets including class imbalance binary datasets and IoT time-series datasets. With very few iterations, we are able to achieve performance very close to the baseline and beat the baseline for some datasets.
I Know That Person: Generative Full Body and Face De-Identification of People in Images
Karla Brkic, University of Zagreb; Ivan Sikiric, Mireo; Tomislav Hrkac, University of Zagreb; Zoran Kalafatic, University of Zagreb
Nowadays, cameras are everywhere. We are used to being watched, photographed and we commonly see our images online without having given our explicit consent, e.g. in Google Street View, other people's YouTube videos, etc. Simultaneously, advances in computer vision and machine learning make it increasingly easier to automatically process such data and extract potentially privacy-sensitive information. Common attempts to protect the privacy of people in images such as simple face blurring, pixelization, etc., do very little to de-identify (i.e. obfuscate and/or hide) revealing soft biometric and non-biometric identifiers including specifically colored and textured clothing, characteristic hairstyles and personal items, skin marks and tattoos, etc. A more complex method that produces naturally looking fully de-identified images is needed.

In this work, we introduce a model for full body and face de-identification of humans in images. Assuming the silhouette of the person is known, we synthesize an alternative appearance that fits the silhouette and can therefore be seamlessly integrated into the original image. Our model is capable of generating outputs at varying level of detail, depending on the requirements and the available segmentation of the human figure.
The proposed model employs two generative adversarial networks (GANs) (Goodfellow et al., 2014). The first network, used for full body de-identification, is a conditional GAN that generates synthetic human images. The conditional GAN is trained on pairs of human segmentations and human images with the goal of outputting realistically-looking synthetic human images conditioned on the extracted segmentation. The second network is a dedicated face synthesis GAN, specialized to ensure that the rendered face looks realistic and has an adequate level of detail.

Experimental evaluation is performed on the Clothing Co-Parsing (CCP) dataset (Yang et al., 2014) and the Human3.6M dataset (Ionescu et al., 2014). We perform a perceptual study where users evaluate the similarity of original and de-identified images and rate naturalness and recognizability of the subject and a re-identification experiment in which we measure the top-k retrieval performance given a de-identified image as query. Perceptual and re-identification experiments show that our model generates images that look natural, while offering a strong level of identity protection. In general, the more detailed the segmentation input, the higher the naturalness of the de-identified output. Our model is applicable across datasets, and output naturalness can be improved using a naive segmentation strategy. The proposed model represents a significant step forward in thwarting human and machine recognition of de-identified images, while preserving data utility and naturalness.
Counterfactual regularization: Mitigating the effect of confounders
Christina Heinze-Deml, ETH Zurich; Nicolai Meinshausen, ETH Zurich
When training a deep network for image classification, one can broadly distinguish between two types of latent features of images that will drive the classification: (i) "immutable" or "core" features that are inherent to the object in question and do not change substantially from one instance of the object to another and (ii) "mutable" or "style" features such as position, rotation or image quality but also more complex ones like hair color or posture for images of persons. The distribution of the style features can change in the future. While transfer learning would try to adapt to a shift in the distribution(s), we here want to protect against future adversarial domain shifts, arising through changing style features, by ideally not using the mutable style features altogether.
There are two broad scenarios and we show how exploiting grouping information in the data helps in both. (a) If the style features are known explicitly (e.g. rotation) one usually proceeds by using data augmentation. By exploiting the grouping information about which original image an augmented sample belongs to, we can reduce the sample size required to achieve invariance to the style feature in question. (b) Sometimes the style features are not known explicitly but we still have information about samples that belong to the same underlying object (such as different pictures of the same person). By constraining the classification to give the same forecast for all instances that belong to the same object, we show how using this grouping information leads to invariance to such implicit style features and helps to protect against adversarial domain shifts.
We provide a causal framework for the problem and treat groups of instances of the same object as counterfactuals under different interventions on the mutable style features. We show links to questions of fairness, transfer learning and adversarial examples.
Detecting Community Structures in Hierarchies
Phuc Nguyen, Macalester College; Daniel Larremore, University of Colorado Boulder; Caterina De Bacco, Santa Fe Institute; Cris Moore, Santa Fe Institute
Hidden hierarchies exist in many natural and artificial systems (i.e. animal groups, university hiring network) and provide insights into the structures and dynamics of these systems. Such hierarchy might contain additional structures like tiers or overlapping sub-hierarchies. However, ranking methods that return ordinal ranks tend to overlook these structures.

We propose a technique to detect community structures in hierarchies using a new physics-based ranking model. This model assumes that similarly ranked vertices interact and edge directions imply the rankings. It then assigns real-valued scores to vertices. Applying clustering algorithms such as k-means to these scores can reveal tiered community structures, though not overlapping sub-hierarchies.

We show that these scores have a multivariate normal distribution and derive their correlation matrix. Using synthetic data, we demonstrate that applying k-means to the correlations between scores outperforms clustering of raw scores in identifying tiers in a hierarchy, especially as the data become noisier. In addition, we show that clusters of correlations between scores can reveal overlapping sub-hierarchies. The ranking method’s generative model provides a natural null model for this type of community structure. We utilize this null model to further improve our technique’s performance when overlapped regions are smaller, though at an increase in computational cost. Finally, we test our technique’s performance on real network data.
From Restaurants to Campus: Exploring Students’ Check-in Behavior
Mengyue Hang, Purdue University; Jennifer Neville, Purdue University
Millions of check-in records in location-based social networks (LBSNs) provide an opportunity to study users’ mobility pattern and social behavior from a spatial- temporal perspective. In recent years, the point-of- interest (POI) recommendation problem has attracted significant attention. In POI tasks, the goal is to analyze users’ past history of activity and then make recommendations based on their current context (including spatial, temporal, and contextual information). How- ever, recent work on developing POI methods has been conducted solely on voluntary check-in datasets collected from Foursquare or Yelp. While these well-known datasets contain rich information about food, nightlife, and entertainment, due to the nature of the applications, there is a lack of diversity in the category of activities— which prevents machine learning methods from reaching a deeper understanding of users’ daily routines.
Moreover, relying on user created content (reported vol- untarily) can bias analysis when we study mobility pat- terns or personal preferences, since there is a large quan- tity of unreported visits. While GPS tracking can provide more extensive information about users’ movements, it does not contain the rich venue information that check- in data provides. Here, we present the first analysis of a spatial-temporal educational “check-in” dataset, which records (anonymized) users’ access to WiFi access points on campus, with venue information about locations (e.g., dining hall, library, dorm, classroom). Specifically, we analyze WiFi log-in history for 3000 students across 164 buildings over the first five week period of a Fall semester. Compared to well-known check-in datasets like Foursquare, these data contains (1) more active users, (2) richer set of activities (e.g., study, dine, exercise, rest), and (3) well-annotated activity range (i.e., on campus). These characteristics make it easier to analyze the unique properties of user check-in data and extract interesting social and mobility patterns.
We aim to capture students’ behavioral pattern by formulating a time-aware location prediction problem. Given a user and time slot (e.g. Monday 8 am), the model should predict a place most likely to be visited. To better lever- age contextual information, we propose a joint embed- ding model which maps user, location, time and activity category into a common latent space. We first generate a heterogeneous graph using the check-in records (see Figure 1). Then we learn continuous feature representations for nodes by capturing features of connectivity and structural similarity for pair of nodes. For instance, even there is no edge between two user nodes, we can still model their similarity in the distribution of edge-weights over other “contextual” nodes (i.e. location node, etc.). Once we have learnt representations for users, time slots and locations, we can perform location prediction on new check-in data with simple operations on vectors.
In our experiments, we concatenated each student’s first 80% check-in records in chronological order to create the training set examples and then used the remaining as the test set. Our model achieves prediction accuracy@10 of 88% and accuracy@5 of 79%, which out-performs competing methods significantly. When we consider only out-of-class check-ins, model accuracy is even higher, which indicates that students change their class registration in the first several weeks of a new semester, but retain a more predictable schedule for out-of-class activities. The similar models only achieve accuracy@10 of 44% on Foursquare data—which indicates the presence of more meaningful spatial-temporal patterns worth investigating in the educational check-in data.
Learning the probability of activation in the presence of latent spreaders
Maggie, Makar, MIT; John, Guttag, MIT; Jenna, Wiens; University of Michigan
When an infection spreads in a community, an individual's probability of becoming infected depends on both her susceptibility and exposure to the contagion through contact with others. While one often has knowledge regarding an individual's susceptibility, in many cases, whether or not an individual's contacts are contagious is unknown. We study the problem of predicting if an individual will adopt a contagion in the presence of multiple modes of infection (exposure/susceptibility) and latent neighbor influence. We present a generative probabilistic model and a variational inference method to learn the parameters of our model. Through a series of experiments on synthetic data, we measure the ability of the proposed model to identify latent spreaders, and predict the risk of infection. Applied to a real dataset of 20,000 hospital patients, we demonstrate the utility of our model in predicting the onset of a healthcare associated infection using patient room-sharing and nurse-sharing networks. Our model outperforms existing benchmarks and provides actionable insights for the design and implementation of targeted interventions to curb the spread of infection.
Improving EEG-based Brain-Computer Interfaces with User Response to Feedback
Mahta Mousavi, UC San Diego; Virginia de Sa, UC San Diego
EEG-based brain computer interface (BCI) systems collect and infer neural information from the brain through electroencephalography (EEG) without using the usual neuromuscular pathways. These systems are known to have numerous applications for both healthy and patient populations [Nicholas Alonso and Gomez-Gil, Sensors, 2012]. Motor imagery (MI) is a common BCI paradigm where a user imagines moving a part of her/his body (such as right or left hand, feet or the tongue) without actually moving it. The power of the signals from different spatial filters in different frequency bands are thus used as the features to distinguish the user’s intended movement. These can be translated into control commands, e.g. moving a cursor on the screen towards various targets, which can be mapped to different tasks and enable the user to interact with the world.

One challenge in EEG-based BCIs corresponds to the limited information that can be collected non-invasively as the skull and scalp act as a low-pass filter. Moreover, while the user is performing a specific task – motor imagery in this case – other brain processes are also active and add to the recorded signal. One interfering process corresponds to the user’s perception of the BCI performance: for instance, whether the cursor on the screen is moving towards the target or not. This perception can induce emotional states such as satisfaction (happiness) if the cursor is moving in the expected (imagined) direction or dissatisfaction (frustration) otherwise. Instead of ignoring or treating this signal as a source of noise, we recognize it as an important means to infer motor imagery intent. Our goal is to study optimal ways to identify both motor imagery and satisfaction/dissatisfaction information in a task, and to train classifiers that combine this information for an improved control of the BCI.

In a recent study [Mousavi et al., BCI Journal, 2017], we collected data from 10 healthy participants in a right/left motor imagery task with feedback presented as a cursor on the screen moving to the right or left. The provided feedback was random and independent of the motor imagery signal of the participant so that the collected data contained enough instances of ‘towards’ and ‘away’ (with respect to the target) cursor movements for every participant. However, the participants believed they were controlling the cursor. We trained two linear discriminant analysis (LDA) classifiers: the conventional right/left hand motor imagery, and another classifying participant’s satisfaction with BCI performance. Our results show that right/left imagery and satisfaction/dissatisfaction classifiers both perform above chance, and the correlation between the two is not significant in most participants. We applied logistic regression to combine scores from the two classifiers and present an improved BCI performance with average improvement of 11% and up to 22% in accuracy, on a per participant basis. We also replaced the LDA’s with neural networks comprising one hidden layer with 10 units for both right/left and satisfaction/dissatisfaction classifiers and found similar results.

Currently, we are designing a neural network to more flexibly determine the best way to combine the motor imagery and satisfaction/dissatisfaction information through the idea of multi-task learning [Caruana, Machine learning, 1997]. Combining the information from the two sources is not straight-forward, since when the cursor moves towards the right direction, the satisfaction/dissatisfaction is directly mapped to right/left; however, when the cursor moves to the left, the satisfaction/dissatisfaction is mapped to left/right instead. The ‘observed direction of movement’ is the piece of information that enables the multi-task network to learn the process. Moreover, the previous cursor movements may also affect the current state of the brain; therefore, the brain signal after observing multiple previous cursor movements could also be another source of information for the multi-task network.
Training Fairer Classifiers: Problems, Prescience, and Proxies
Maya R. Gupta, Google; Andrew Cotter, Google; Mahdi Milani Fard, Google; Serena Wang, Google
We consider the problem of improving metrics defined on groups when training classifiers, which is often referred to as fairness goals. We catalog the major non-adversarial challenges we have identified in training for fairness goals, as awareness of these issues can help improve fairness metrics. One major practical issue is what we term the prescient fairness problem: that one often lacks a dataset labeled with the protected groups due to data collection or privacy reasons, or failure to imagine future fairness concerns. To address this, we experimentally investigate training classifiers to improve a fairness metric for proxy groups, in the hope that it improves that fairness metric on the true protected groups. Results on benchmark and real-world datasets demonstrate that such a proxy fairness strategy can work well. However, we caution that the effectiveness of such proxy fairness likely depends strongly on the choice of fairness metric, as well as how aligned the proxy groups are with the true protected groups. Lastly, with any method that incorporates fairness goals into the training, we warn that overfitting is a concern.
Automating Cervical Cancer Diagnosis in Low Resource Settings Using Image Processing and Supervised Machine Learning Techniques.
Mercy Asiedu, Duke University Department of Biomedical Engineering; Anish Simhal, Duke University Department of Electrical and Computer Engineering; Guillermo Sapiro, Duke University Department of Electrical and Computer Engineering; Nimmi Ramanujam, Duke University Department of Biomedical Engineering
The world health organization recommends visual inspection with acetic acid (VIA) and/or Lugol’s Iodine (VILI) contrast agents for cervical cancer screening in low-resource settings. Interpretation of diagnostic indicators for visual inspection is qualitative, subjective, and has high inter-observer discordance, dependent on physician or provider experience. In low-resource settings, expert physician interpretation may not be available and misdiagnosis could lead to under treatment or overtreatment, which could lead to adverse outcomes for the patient. An automated diagnosis based on objectively quantifiable cervical image (cervigram) features would thus be highly valued; particularly in low-resource contexts, lacking trained medical personnel. In this work, we propose a simple method for automatic feature extraction and support vector machine-based classification for Lugol’s Iodine cervigrams acquired with a low-cost, miniature, digital colposcope. Physician- and pathology- labelled cervigrams obtained from an institutional review board (IRB) approved clinical study were divided into a training set (70%) and testing set (30%). We developed algorithms to pre-process the cervigrams to remove clinically irrelevant artifacts and to extract simple color-based features. From these features, a subset of optimal features was selected using sequential feature selection and cross-validation on the training set. These features were used to train a semi-supervised, support vector machine model to classify cervigrams as pre-cancer negative or positive. The model was then validated on the testing data set for classification of cervigrams. Classification outcomes were compared to expert physician interpretation and pathology, the later using data not available to the proposed framework. The proposed algorithms achieved a sensitivity of 81.8% and specificity of 72.2%, roughly equivalent to the average physician interpretation (sensitivity=78.4%, specificity=76.4%). The area under the curve was 80.8% with an overall accuracy of 77.5%. The results suggest that utilizing simple color-based features may enable unbiased automation of VILI cervigrams, opening the door to a full system of low-cost data acquisition complemented with automatic interpretation.
Luoluo Liu, Peter Chin, Trac Tran
Classical signal recovery based on L1 minimization solves the least squares problem with all available measurements via sparsity promoting
regularization. In practice, it is often the case that not all measurements are available or required for recovery. Measurements might be corrupted/missing or they arrive sequentially in streaming fashion. In this paper, we propose a global sparse recovery strategy based on subsets of measurements, named JOBS, in which multiple measurements vectors are generated from the original pool of
measurements via bootstrapping and then a joint-sparse constraint is enforced to ensure the support consistency among multiple predictors. The final estimate is obtained by averaging over the k predictors. The performance limits associated with different choices of number of bootstrap samples L and number of estimates k is analyzed theoretically. Simulation results show that the proposed method yields state-of-the-art recovery performance, outperforming
L1 minimization and a few other existing bagging-based techniques
in the challenging case of low levels of measurements and is preferable
over other bagging-based methods in the streaming setting.
Forest Fire detection using drones
Anusua Trivedi, Microsoft; Isha Chakraborty, Monta Vista High School
INTRODUCTION: In this talk, we propose deep learning techniques for improved detection of forest fire. We propose a method to synthesize similar training images using General Adversarial Networks (GANs) on a BING-scraped image dataset. We apply a pre-trained deep convolutional neural network (DCNN) on these generated and labeled images to improve prediction accuracy. We use an ImageNet pre-trained DCNN and apply fine-tuning to transfer the learned features to these new domain specific images to improve prediction. We deploy this fine-tuned model in Azure as a webservice and use this web service on top of a Parrot drone to detect a forest fire in real time. Our approach improves prediction accuracy from ~65% to ~88% on domain-specific datasets, compared to state-of-the-art Machine Learning approaches.

MOTIVATION: Scientists have shown through historical records that the effects of climate warming on natural and human systems are becoming increasingly visible through increased magnitude of forest fires [1,2,3,4]. We propose an improved approach to predict forest fire preemptively, so that we can control the natural and human damages to some extent. We try to show how a combination of GANs and DCNNs leads to re-usability of a pre-trained DCNN model in a completely different disjoint domain. We talk about an end-to-end pipeline to generate artificial images using GANs, use DCNN to classify and predict forest fire accurately and easy access of trained model from a Parrot drone.

General Adversarial Network for image synthesis: We use GANs to generate artificial images from original dataset, which helps us get more training data for deep neural network training. We use two methods - DCGAN [10] and InfoGAN [11]. DCGAN is the most basic, widely used and more stable method for simple image synthesis, while InfoGAN is a modification of DCGAN that can disentangle the feature representation in order to generate images with required features. Based on some existing implementations [5], we trained both the GAN models on our BING-scraped forest-fire dataset to generate more artificial images, which helped improved the classification accuracy in turn.
Transfer Learning & Fine-tuning DCNNs: Current trends in the research have demonstrated that DCNNs are very effective in automatically analyzing large collections of images and identifying features that can categorize images with minimum error. DCNNs are rarely trained from scratch, as it is relatively uncommon to have a domain-specific dataset of sufficient size. Since modern DCNNs take 2-3 weeks to train across GPUs, Berkley Vision and Learning Center (BVLC) have released some final DCNN checkpoints. In this work, we use an ImageNet pre-trained VGGNet [6].
Deep Learning models for Forest-Fire Image Classification: We fine-tune the pre-trained generic DCNN to recognize forest-fire images and improve fire detection accuracy. Our approach is an end-to-end learning strategy, with minimum assumptions about the contents of images. We show that our approach improves prediction accuracy upon the results produced by the Support Vector Machine Approach.

B.J. Harvey: http://www.pnas.org/content/113/42/11649.full.pdf
Westerling AL et. al (2006) Warming and earlier spring increase western U.S. forest wildfire activity.
Calder WJ et. al (2015) Medieval warming initiated exceptionally large wildfire outbreaks in the Rocky Mountains.
Littell JS et. al (2009) Climate and wildfire area burned in western U.S. ecoprovinces, 1916-2003.
J Raiman: https://github.com/JonathanRaiman/tensorflow-infogan
K. Simonyan et. al: https://arxiv.org/pdf/1409.1556.pdf
CS231n: Convolutional Neural Networks for Visual Recognition
Classifying plankton with deep neural networks
Yoshua Bengio. Deep Learning of Representations for Unsupervised and Transfer Learning.
Chintala et. al: https://arxiv.org/pdf/1511.06434.pdf
Chen et. al: https://arxiv.org/pdf/1606.03657.pdf
Main menu