## Neural and Evolutionary Computing

### Deep Learning with Sets and Point Clouds

Comments: under review at ICLR

**Subjects**

:

Machine Learning (stat.ML)

; Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

We study a simple notion of structural invariance that readily suggests a

parameter-sharing scheme in deep neural networks. In particular, we define

structure as a collection of relations, and derive graph convolution and

recurrent neural networks as special cases. We study composition of basic

structures in defining models that are invariant to more complex “product”

structures such as graph of graphs, sets of images or sequence of sets. For

demonstration, our experimental results are focused on the setting where the

discrete structure of interest is a set. We present results on several novel

and non-trivial problems on sets, including semi-supervised learning using

clustering information, point-cloud classification and set outlier detection.

### Generate Models and Model Criticism via Optimized Maximum Mean Discrepancy

Comments: Submitted to ICLR 2017; public comments at this http URL

**Subjects**

:

Machine Learning (stat.ML)

; Artificial Intelligence (cs.AI); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Methodology (stat.ME)

We propose a method to optimize the representation and distinguishability of

samples from two probability distributions, by maximizing the estimated power

of a statistical test based on the maximum mean discrepancy (MMD). This

optimized MMD is applied to the setting of unsupervised learning by generative

adversarial networks (GAN), in which a model attempts to generate realistic

samples, and a discriminator attempts to tell these apart from data samples. In

this context, the MMD may be used in two roles: first, as a discriminator,

either directly on the samples, or on features of the samples. Second, the MMD

can be used to evaluate the performance of a generative model, by testing the

model’s samples against a reference data set. In the latter role, the optimized

MMD is particularly helpful, as it gives an interpretable indication of how the

model and data distributions differ, even in cases where individual model

samples are not easily distinguished either by eye or by classifier.

Comments: 4 pages, 13 figures

**Subjects**

:

Emerging Technologies (cs.ET)

; Neural and Evolutionary Computing (cs.NE)

We experimentally demonstrate classification of 4×4 binary images into 4

classes, using a 3-layer mixed-signal neuromorphic network (“MLP perceptron”),

based on two passive 20×20 memristive crossbar arrays, board-integrated with

discrete CMOS components. The network features 10 hidden-layer and 4

output-layer analog CMOS neurons and 428 metal-oxide memristors, i.e. is almost

an order of magnitude more complex than any previously reported functional

memristor circuit. Moreover, the inference operation of this classifier is

performed entirely in the integrated hardware. To deal with larger crossbar

arrays, we have developed a semi-automatic approach to their forming and

testing, and compared several memristor training schemes for coping with

imperfect behavior of these devices, as well as with variability of analog CMOS

neurons. The effectiveness of the proposed schemes for defect and variation

tolerance was verified experimentally using the implemented network and,

additionally, by modeling the operation of a larger network, with 300

hidden-layer neurons, on the MNIST benchmark. Finally, we propose a simple

modification of the implemented memristor-based vector-by-matrix multiplier to

allow its operation in a wider temperature range.

### Attending to Characters in Neural Sequence Labeling Models

Comments: Proceedings of COLING 2016

**Subjects**

:

Computation and Language (cs.CL)

; Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Sequence labeling architectures use word embeddings for capturing similarity,

but suffer when handling previously unseen or rare words. We investigate

character-level extensions to such models and propose a novel architecture for

combining alternative word representations. By using an attention mechanism,

the model is able to dynamically decide how much information to use from a

word- or character-level component. We evaluated different architectures on a

range of sequence labeling datasets, and character-level extensions were found

to improve performance on every benchmark. In addition, the proposed

attention-based architecture delivered the best results even with a smaller

number of trainable parameters.

### Identity Matters in Deep Learning

Moritz Hardt , Tengyu Ma **Subjects** : Learning (cs.LG) ; Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

An emerging design principle in deep learning is that each layer of a deep

artificial neural network should be able to easily express the identity

transformation. This idea not only motivated various normalization techniques,

such as emph{batch normalization}, but was also key to the immense success of

emph{residual networks}.

In this work, we put the principle of emph{identity parameterization} on a

more solid theoretical footing alongside further empirical progress. We first

give a strikingly simple proof that arbitrarily deep linear residual networks

have no spurious local optima. The same result for linear feed-forward networks

in their standard parameterization is substantially more delicate. Second, we

show that residual networks with ReLu activations have universal finite-sample

expressivity in the sense that the network can represent any function of its

sample provided that the model has more parameters than the sample size.

Directly inspired by our theory, we experiment with a radically simple

residual architecture consisting of only residual convolutional layers and ReLu

activations, but no batch normalization, dropout, or max pool. Our model

improves significantly on previous all-convolutional networks on the CIFAR10,

CIFAR100, and ImageNet classification benchmarks.

## Computer Vision and Pattern Recognition

### 3-D Convolutional Neural Networks for Glioblastoma Segmentation

Darvin Yi , Mu Zhou , Zhao Chen , Olivier Gevaert **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

Convolutional Neural Networks (CNN) have emerged as powerful tools for

learning discriminative image features. In this paper, we propose a framework

of 3-D fully CNN models for Glioblastoma segmentation from multi-modality MRI

data. By generalizing CNN models to true 3-D convolutions in learning 3-D tumor

MRI data, the proposed approach utilizes a unique network architecture to

decouple image pixels. Specifically, we design a convolutional layer with

pre-defined Difference- of-Gaussian (DoG) filters to perform true 3-D

convolution incorporating local neighborhood information at each pixel. We then

use three trained convolutional layers that act to decouple voxels from the

initial 3-D convolution. The proposed framework allows identification of

high-level tumor structures on MRI. We evaluate segmentation performance on the

BRATS segmentation dataset with 274 tumor samples. Extensive experimental

results demonstrate encouraging performance of the proposed approach comparing

to the state-of-the-art methods. Our data-driven approach achieves a median

Dice score accuracy of 89% in whole tumor glioblastoma segmentation, revealing

a generalized low-bias possibility to learn from medium-size MRI datasets.

Fast Task-Specific Target Detection via Graph Based Constraints Representation and Checking

Went Luan , Yezhou Yang , Cornelia Fermuller , John S. Baras **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

In this work, we present a fast target detection framework for real-world

robotics applications. Considering that an intelligent agent attends to a

task-specific object target during execution, our goal is to detect the object

efficiently. We propose the concept of early recognition, which influences the

candidate proposal process to achieve fast and reliable detection performance.

To check the target constraints efficiently, we put forward a novel policy to

generate a sub-optimal checking order, and prove that it has bounded time cost

compared to the optimal checking sequence, which is not achievable in

polynomial time. Experiments on two different scenarios: 1) rigid object and 2)

non-rigid body part detection validate our pipeline. To show that our method is

widely applicable, we further present a human-robot interaction system based on

our non-rigid body part detection.

### Can fully convolutional networks perform well for general image restoration problems?

Subhajit Chaudhury , Hiya Roy **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

We present a fully convolutional network(FCN) based approach for color image

restoration. Fully convolutional networks have recently shown remarkable

performance for high level vision problems like semantic segmentation. In this

paper, we investigate if fully convolutional networks can show promising

performance for low level problems like image restoration as well. We propose a

FCN model, that learns a direct end-to-end mapping between the corrupted images

as input and the desired clean images as output. Experimental results show that

our FCN model outperforms traditional sparse coding based methods and

demonstrates competitive performance compared to the state-of-the-art methods

for image denoising. We further show that our proposed model can solve the

difficult problem of blind image inpainting and can produce reconstructed

images of impressive visual quality.

### Automatic discovery of discriminative parts as a quadratic assignment problem

Ronan Sicre , Julien Rabin , Yannis Avrithis , Teddy Furon , Frederic Jurie **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

Part-based image classification consists in representing categories by small

sets of discriminative parts upon which a representation of the images is

built. This paper addresses the question of how to automatically learn such

parts from a set of labeled training images. The training of parts is cast as a

quadratic assignment problem in which optimal correspondences between image

regions and parts are automatically learned. The paper analyses different

assignment strategies and thoroughly evaluates them on two public datasets:

Willow actions and MIT 67 scenes. State-of-the art results are obtained on

these datasets.

### Joint Graph Decomposition and Node Labeling by Local Search

Evgeny Levinkov , Siyu Tang , Eldar Insafutdinov , Bjoern Andres **Subjects** : Computer Vision and Pattern Recognition (cs.CV) ; Discrete Mathematics (cs.DM)

We state a combinatorial optimization problem whose feasible solutions define

both a decomposition and a node labeling of a given graph. This problem offers

a common mathematical abstraction of seemingly unrelated computer vision tasks,

including instance-separating semantic segmentation, articulated human body

pose estimation and multiple object tracking. Conceptually, the problem we

propose generalizes the unconstrained integer quadratic program and the minimum

cost lifted multicut problem, both of which are NP-hard. In order to find

feasible solutions efficiently, we define a local search algorithm that

converges monotonously to a local optimum, offering a feasible solution at any

time. To demonstrate the effectiveness of this algorithm in solving computer

vision tasks, we report running times and competitive solutions for two

above-mentioned applications.

### Selfie Detection by Synergy-Constraint Based Convolutional Neural Network

Comments: 8 Pages, Accepted for Publication at IEEE SITIS 2016

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

Categorisation of huge amount of data on the multimedia platform is a crucial

task. In this work, we propose a novel approach to address the subtle problem

of selfie detection for image database segregation on the web, given rapid rise

in number of selfies clicked. A Convolutional Neural Network (CNN) is modeled

to learn a synergy feature in the common subspace of head and shoulder

orientation, derived from Local Binary Pattern (LBP) and Histogram of Oriented

Gradients (HOG) features respectively. This synergy was captured by projecting

the aforementioned features using Canonical Correlation Analysis (CCA). We show

that the resulting network’s convolutional activations in the neighbourhood of

spatial keypoints captured by SIFT are discriminative for selfie-detection. In

general, proposed approach aids in capturing intricacies present in the image

data and has the potential for usage in other subtle image analysis scenarios

apart from just selfie detection. We investigate and analyse the performance of

popular CNN architectures (GoogleNet, AlexNet), used for other image

classification tasks, when subjected to the task of detecting the selfies on

the multimedia platform. The results of the proposed approach are compared with

these popular architectures on a dataset of ninety thousand images comprising

of roughly equal number of selfies and non-selfies. Experimental results on

this dataset shows the effectiveness of the proposed approach.

### Herding Generalizes Diverse M -Best Solutions

Comments: 10 pages, 2 algorithms, 2 figures

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

We show that the algorithm to extract diverse M -solutions from a Conditional

Random Field (called divMbest [1]) takes exactly the form of a Herding

procedure [2], i.e. a deterministic dynamical system that produces a sequence

of hypotheses that respect a set of observed moment constraints. This

generalization enables us to invoke properties of Herding that show that

divMbest enforces implausible constraints which may yield wrong assumptions for

some problem settings. Our experiments in semantic segmentation demonstrate

that seeing divMbest as an instance of Herding leads to better alternatives for

the implausible constraints of divMbest.

### A DNN Framework For Text Image Rectification From Planar Transformations

Comments: 9 pages, 10 figures

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

In this paper, a novel neural network architecture is proposed attempting to

rectify text images with mild assumptions. A new dataset of text images is

collected to verify our model and open to public. We explored the capability of

deep neural network in learning geometric transformation and found the model

could segment the text image without explicit supervised segmentation

information. Experiments show the architecture proposed can restore planar

transformations with wonderful robustness and effectiveness.

### Baseline CNN structure analysis for facial expression recognition

Comments: 6 pages, RO-MAN2016 Conference

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

We present a baseline convolutional neural network (CNN) structure and image

preprocessing methodology to improve facial expression recognition algorithm

using CNN. To analyze the most efficient network structure, we investigated

four network structures that are known to show good performance in facial

expression recognition. Moreover, we also investigated the effect of input

image preprocessing methods. Five types of data input (raw, histogram

equalization, isotropic smoothing, diffusion-based normalization, difference of

Gaussian) were tested, and the accuracy was compared. We trained 20 different

CNN models (4 networks x 5 data input types) and verified the performance of

each network with test images from five different databases. The experiment

result showed that a three-layer structure consisting of a simple convolutional

and a max pooling layer with histogram equalization image input was the most

efficient. We describe the detailed training procedure and analyze the result

of the test accuracy based on considerable observation.

### Multi-Shot Mining Semantic Part Concepts in CNNs

Comments: accepted for presentation at the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

This paper proposes a new learning strategy that incrementally embeds new

object-part concepts into a pre-trained convolutional neural network (CNN), in

order to 1) explore explicit semantics for the CNN units and 2) gradually

transfer the pre-trained CNN into a “white-box” model for hierarchical object

understanding. Given part annotations on a very small number (e.g. 3–12) of

objects, our method mines certain units from the pre-trained CNN and associate

them with different part concepts. We use a four-layer And-Or graph to organize

the CNN units, which clarifies their internal semantic hierarchy. Our method is

guided by a small number of part annotations, and it achieves superior

part-localization performance (about 28%–107% improvement in part center

prediction).

### Convolutional Regression for Visual Tracking

Kai Chen , Wenbing Tao **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

Recently, discriminatively learned correlation filters (DCF) has drawn much

attention in visual object tracking community. The success of DCF is

potentially attributed to the fact that a large amount of samples are utilized

to train the ridge regression model and predict the location of object. To

solve the regression problem in an efficient way, these samples are all

generated by circularly shifting from a search patch. However, these synthetic

samples also induce some negative effects which decrease the robustness of DCF

based trackers.

In this paper, we propose a Convolutional Regression framework for visual

tracking (CRT). Instead of learning the linear regression model in a closed

form, we try to solve the regression problem by optimizing a one-channel-output

convolution layer with Gradient Descent (GD). In particular, the receptive

field size of the convolution layer is set to the size of object. Contrary to

DCF, it is possible to incorporate all “real” samples clipped from the whole

image. A critical issue of the GD approach is that most of the convolutional

samples are negative and the contribution of positive samples will be

submerged. To address this problem, we propose a novel Automatic Hard Negative

Mining method to eliminate easy negatives and enhance positives. Extensive

experiments are conducted on a widely-used benchmark with 100 sequences. The

results show that the proposed algorithm achieves outstanding performance and

outperforms almost all the existing DCF based algorithms.

### Semi-Dense 3D Semantic Mapping from Monocular SLAM

Xuanpeng Li , Rachid Belaroussi **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

The bundle of geometry and appearance in computer vision has proven to be a

promising solution for robots across a wide variety of applications. Stereo

cameras and RGB-D sensors are widely used to realise fast 3D reconstruction and

trajectory tracking in a dense way. However, they lack flexibility of seamless

switch between different scaled environments, i.e., indoor and outdoor scenes.

In addition, semantic information are still hard to acquire in a 3D mapping. We

address this challenge by combining the state-of-art deep learning method and

semi-dense Simultaneous Localisation and Mapping (SLAM) based on video stream

from a monocular camera. In our approach, 2D semantic information are

transferred to 3D mapping via correspondence between connective Keyframes with

spatial consistency. There is no need to obtain a semantic segmentation for

each frame in a sequence, so that it could achieve a reasonable computation

time. We evaluate our method on indoor/outdoor datasets and lead to an

improvement in the 2D semantic labelling over baseline single frame

predictions.

### Hand Gesture Recognition for Contactless Device Control in Operating Rooms

Ebrahim Nasr-Esfahani , Nader Karimi , S.M. Reza Soroushmehr , M. Hossein Jafari , M. Amin Khorsandi , Shadrokh Samavi , Kayvan Najarian **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

Hand gesture is one of the most important means of touchless communication

between human and machines. There is a great interest for commanding electronic

equipment in surgery rooms by hand gesture for reducing the time of surgery and

the potential for infection. There are challenges in implementation of a hand

gesture recognition system. It has to fulfill requirements such as high

accuracy and fast response. In this paper we introduce a system of hand gesture

recognition based on a deep learning approach. Deep learning is known as an

accurate detection model, but its high complexity prevents it from being

fabricated as an embedded system. To cope with this problem, we applied some

changes in the structure of our work to achieve low complexity. As a result,

the proposed method could be implemented on a naive embedded system. Our

experiments show that the proposed system results in higher accuracy while

having less complexity in comparison with the existing comparable methods.

### Automated Inference on Criminality using Face Images

Xiaolin Wu , Xi Zhang **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

We study, for the first time, automated inference on criminality based solely

on still face images. Via supervised machine learning, we build four

classifiers (logistic regression, KNN, SVM, CNN) using facial images of 1856

real persons controlled for race, gender, age and facial expressions, nearly

half of whom were convicted criminals, for discriminating between criminals and

non-criminals. All four classifiers perform consistently well and produce

evidence for the validity of automated face-induced inference on criminality,

despite the historical controversy surrounding the topic. Also, we find some

discriminating structural features for predicting criminality, such as lip

curvature, eye inner corner distance, and the so-called nose-mouth angle. Above

all, the most important discovery of this research is that criminal and

non-criminal face images populate two quite distinctive manifolds. The

variation among criminal faces is significantly greater than that of the

non-criminal faces. The two manifolds consisting of criminal and non-criminal

faces appear to be concentric, with the non-criminal manifold lying in the

kernel with a smaller span, exhibiting a law of normality for faces of

non-criminals. In other words, the faces of general law-biding public have a

greater degree of resemblance compared with the faces of criminals, or

criminals have a higher degree of dissimilarity in facial appearance than

normal people.

### Multi-class Generative Adversarial Networks with the L2 Loss Function

Xudong Mao , Qing Li , Haoran Xie , Raymond Y.K. Lau , Zhen Wang **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

Generative adversarial networks (GANs) have achieved huge success in

unsupervised learning. Most of GANs treat the discriminator as a classifier

with the binary sigmoid cross entropy loss function. However, we find that the

sigmoid cross entropy loss function will sometimes lead to the saturation

problem in GANs learning. In this work, we propose to adopt the L2 loss

function for the discriminator. The properties of the L2 loss function can

improve the stabilization of GANs learning. With the usage of the L2 loss

function, we propose the multi-class generative adversarial networks for the

purpose of image generation with multiple classes. We evaluate the multi-class

GANs on a handwritten Chinese characters dataset with 3740 classes. The

experiments demonstrate that the multi-class GANs can generate elegant images

on datasets with a large number of classes. Comparison experiments between the

L2 loss function and the sigmoid cross entropy loss function are also conducted

and the results demonstrate the stabilization of the L2 loss function.

### Leveraging Video Descriptions to Learn Video Question Answering

Comments: 7 pages, 5 figures. Accepted to AAAI 2017

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

; Artificial Intelligence (cs.AI); Multimedia (cs.MM)

We propose a scalable approach to learn video-based question answering (QA):

answer a “free-form natural language question” about a video content. Our

approach automatically harvests a large number of videos and descriptions

freely available online. Then, a large number of candidate QA pairs are

automatically generated from descriptions rather than manually annotated. Next,

we use these candidate QA pairs to train a number of video-based QA methods

extended fromMN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et

al. 2015), SS (Venugopalan et al. 2015). In order to handle non-perfect

candidate QA pairs, we propose a self-paced learning procedure to iteratively

identify them and mitigate their effects in training. Finally, we evaluate

performance on manually generated video-based QA pairs. The results show that

our self-paced learning procedure is effective, and the extended SS model

outperforms various baselines.

### Optimized clothes segmentation to boost gender classification in unconstrained scenarios

D. Freire-Obregón , M. Castrillón-Santana , J. Lorenzo-Navarro **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

Several applications require demographic information of ordinary people in

unconstrained scenarios. This is not a trivial task due to significant human

appearance variations. In this work, we introduce trixels for clustering image

regions, enumerating their advantages compared to superpixels. The classical

GrabCut algorithm is later modified to segment trixels instead of pixels in an

unsupervised context. Combining with face detection lead us to a clothes

segmentation approach close to real time. The study uses the challenging Pascal

VOC dataset for segmentation evaluation experiments. A final experiment

analyzes the fusion of clothes features with state-of-the-art gender

classifiers in ClothesDB, revealing a significant performance improvement in

gender classification.

Dapeng Luo , Zhipeng Zeng , Longsheng Wei , Yongwen Liu , Chen Luo , Jun Chen , Nong Sang **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

In this paper, our proposed framework takes a remarkably different direction

to resolve multi-view detection problem in a bottom-up fashion. Firstly, a

state of the art scene-specific objector is obtained from a fully autonomous

learning process triggered by marking several bounding boxes around the object

in the first video frame via a mouse, where the human labeled training data or

a generic detector are not needed. Secondly, this learning process is

conveniently replicated many times in different surveillance scenes and result

in a particular detector under various camera viewpoints. Thus, the proposed

framework can be employed in multi-view object detection application with

unsupervised manner. Obviously, the initial scene-specific detector, initialed

by several bounding boxes, exhibits poor detection performance and difficult to

improve by traditional online learning algorithm. Consequently, in our

self-learning process, we propose to partition detection responses space by

online hybrid classifiers and assign each partition individual classifier that

achieves the high classification accuracy progressively. A novel online gradual

learning algorithm is proposed to train the hybrid classifiers automatically

and focus online learning on the hard samples, the most informative samples

lying around the decision boundary. The output is a hybrid classifier based

scene-specific detector which achieves decent performance under different

viewing angles. Experimental results on several video datasets show that our

approach achieves comparable performance to robust supervised methods and

outperforms the state of the art online learning methods in varying imaging

conditions.

### When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing Features

Kuan-Ting Chen , Jiebo Luo **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

With the prevalence of e-commence websites and the ease of online shopping,

consumers are embracing huge amounts of various options in products.

Undeniably, shopping is one of the most essential activities in our society and

studying consumer’s shopping behavior is important for the industry as well as

sociology and psychology. Not surprisingly, one of the most popular e-commerce

categories is clothing business. There arises the needs for analysis of popular

and attractive clothing features which could further boost many emerging

applications, such as clothing recommendation and advertising. In this work, we

design a novel system that consists of three major components: 1) exploring and

organizing a large-scale clothing dataset from a online shopping website, 2)

pruning and extracting images of best-selling products in clothing item data

and user transaction history, and 3) utilizing a machine learning based

approach to discovering clothing attributes as the representative and

discriminative characteristics of popular clothing style elements. Through the

experiments over a large-scale online clothing dataset, we demonstrate the

effectiveness of our proposed system, and obtain useful insights on clothing

consumption trends and profitable clothing features.

### Effective sparse representation of X-Ray medical images

Comments: Routines for implementing the approach are available on this http URL

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

Effective sparse representation of X-Ray medical images within the context of

data reduction is considered. The proposed framework is shown to render an

enormous reduction in the cardinality of the data set required to represent

this class of images at very good quality. The particularity of the approach is

that it can be implemented at very competitive processing time and low memory

requirements

Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Hideki Nakayama , Noriki Nishida **Subjects** : Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

We propose an approach to build a neural machine translation system with no

supervised resources (i.e., no parallel corpora) using multimodal embedded

representation over texts and images. Based on the assumption that text

documents are often likely to be described with other multimedia information

(e.g., images) somewhat related to the content, we try to indirectly estimate

the relevance between two languages. Using multimedia as the “pivot”, we

project all modalities into one common hidden space where samples belonging to

similar semantic concepts should come close to each other, whatever the

observed space of each sample is. This modality-agnostic representation is the

key to bridging the gap between different modalities. Putting a decoder on top

of it, our network can flexibly draw the outputs from any input modality.

Notably, in the testing phase, we need only source language texts as the input

for translation. In experiments, we tested our method on two benchmarks to show

that it can achieve reasonable translation performance. We compared and

investigated several possible implementations and found that an end-to-end

model that simultaneously optimized both rank loss in multimodal encoders and

cross-entropy loss in decoders performed the best.

### (CAD)(^2)RL: Real Single-Image Flight without a Single Real Image

Comments: 11 pages

**Subjects**

:

Learning (cs.LG)

; Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

We propose (CAD)(^2)RL, a flight controller for Collision Avoidance via Deep

Reinforcement Learning that can be used to perform collision-free flight in the

real world although it is trained entirely in a 3D CAD model simulator. Our

method uses only single RGB images from a monocular camera mounted on the robot

as the input and is specialized for indoor hallway following and obstacle

avoidance. In contrast to most indoor navigation techniques that aim to

directly reconstruct the 3D geometry of the environment, our approach directly

predicts the probability of collision given the current monocular image and a

candidate action. To obtain accurate predictions, we develop a deep

reinforcement learning algorithm for learning indoor navigation, which uses the

actual performance of the current policy to construct accurate supervision. The

collision prediction model is represented by a deep convolutional neural

network that directly processes raw image inputs. Our collision avoidance

system is entirely trained in simulation and thus addresses the high sample

complexity of deep reinforcement learning and avoids the dangers of

trial-and-error learning in the real world. By highly randomizing the rendering

settings for our simulated training set, we show that we can train a collision

predictor that generalizes to new environments with substantially different

appearance from the training scenarios. Finally, we evaluate our method in the

real world by controlling a real quadrotor flying through real hallways. We

demonstrate that our method can perform real-world collision avoidance and

hallway following after being trained exclusively on synthetic images, without

ever having seen a single real image at the training time. For supplementary

video see: this https URL

## Artificial Intelligence

### Feature Engineering and Ensemble Modeling for Paper Acceptance Rank Prediction

Comments: 2nd place winner report of KDD Cup 2016. More details can be found at this https URL

**Subjects**

:

Artificial Intelligence (cs.AI)

Measuring research impact and ranking academic achievement are important and

challenging problems. Having an objective picture of research institution is

particularly valuable for students, parents and funding agencies, and also

attracts attention from government and industry. KDD Cup 2016 proposes the

paper acceptance rank prediction task, in which the participants are asked to

rank the importance of institutions based on predicting how many of their

papers will be accepted at the 8 top conferences in computer science. In our

work, we adopt a three-step feature engineering method, including basic

features definition, finding similar conferences to enhance the feature set,

and dimension reduction using PCA. We propose three ranking models and the

ensemble methods for combining such models. Our experiment verifies the

effectiveness of our approach. In KDD Cup 2016, we achieved the overall rank of

the 2nd place.

### Learning for Expertise Matching with Declination Prediction

Yujie Qian , Jie Tang **Subjects** : Artificial Intelligence (cs.AI)

We study the problem of finding appropriate experts who are able to complete

timely reviews and would not say “no” to the invitation. The problem is a

central issue in many question-and-answer systems, but has received little

research attention. Different from most existing studies that focus on

expertise matching, we want to further predict the expert’s response: given a

question, how can we find the expert who is able to provide a quality review

and will agree to do it. We formalize the problem as a ranking problem. We

first present an embedding-based question-to-expert distance metric for

expertise matching and propose a ranking factor graph (RankFG) model to predict

expert response. For online evaluation, we developed a Chrome Extension for

reviewer recommendation and deployed it in the Google Chrome Web Store, and

then collected the reviewers’ feedback. We also used the review bidding of a CS

conference for evaluation. In the experiments, the proposed method demonstrates

its superiority (+6.6-21.2% by MAP) over several state-of-the-art algorithms.

Comments: 7 pages

**Subjects**

:

Artificial Intelligence (cs.AI)

This paper proposes a general framework to combine context and commonsense

knowledge for solving the Winograd Schema (WS) and Pronoun Disambiguation

Problems (PDP). In the proposed framework, commonsense knowledge bases (e.g.

cause-effect word pairs) are quantized as knowledge constraints. The

constraints guide us to learn knowledge enhanced embeddings (KEE) from large

text corpus. Based on the pre-trained KEE models, this paper proposes two

methods to solve the WS and PDP problems. The first method is an unsupervised

method, which represents all the pronouns and candidate mentions in continuous

vector spaces based on their contexts and calculates the semantic similarities

between all the possible word pairs. The pronoun disambiguation procedure could

then be implemented by comparing the semantic similarities between the pronoun

(to be resolved) and all the candidate mentions. The second method is a

supervised method, which extracts features for all the pronouns and candidate

mentions and solves the WS problems by training a typical mention pair

classification model. Similar to the first method, the features used in the

second method are also extracted based on the KEE models. Experiments conducted

on the available PDP and WS test sets show that, these two methods both achieve

consistent improvements over the baseline systems. The best performance reaches

62/% in accuracy on the PDP test set of the first Winograd Schema Challenge.

### Entropic Causal Inference

Comments: To appear in AAAI 2017

**Subjects**

:

Artificial Intelligence (cs.AI)

; Information Theory (cs.IT); Machine Learning (stat.ML)

We consider the problem of identifying the causal direction between two

discrete random variables using observational data. Unlike previous work, we

keep the most general functional model but make an assumption on the unobserved

exogenous variable: Inspired by Occam’s razor, we assume that the exogenous

variable is simple in the true causal direction. We quantify simplicity using

R’enyi entropy. Our main result is that, under natural assumptions, if the

exogenous variable has low (H_0) entropy (cardinality) in the true direction,

it must have high (H_0) entropy in the wrong direction. We establish several

algorithmic hardness results about estimating the minimum entropy exogenous

variable. We show that the problem of finding the exogenous variable with

minimum entropy is equivalent to the problem of finding minimum joint entropy

given (n) marginal distributions, also known as minimum entropy coupling

problem. We propose an efficient greedy algorithm for the minimum entropy

coupling problem, that for (n=2) provably finds a local optimum. This gives a

greedy algorithm for finding the exogenous variable with minimum (H_1) (Shannon

Entropy). Our greedy entropy-based causal inference algorithm has similar

performance to the state of the art additive noise models in real datasets. One

advantage of our approach is that we make no use of the values of random

variables but only their distributions. Our method can therefore be used for

causal inference for both ordinal and also categorical data, unlike additive

noise models.

### A Review on Algorithms for Constraint-based Causal Discovery

Kui Yu , Jiuyong Li , Lin Liu **Subjects** : Artificial Intelligence (cs.AI)

Causal discovery studies the problem of mining causal relationships between

variables from data, which is of primary interest in science. During the past

decades, significant amount of progresses have been made toward this

fundamental data mining paradigm. Recent years, as the availability of abundant

large-sized and complex observational data, the constrain-based approaches have

gradually attracted a lot of interest and have been widely applied to many

diverse real-world problems due to the fast running speed and easy generalizing

to the problem of causal insufficiency. In this paper, we aim to review the

constraint-based causal discovery algorithms. Firstly, we discuss the learning

paradigm of the constraint-based approaches. Secondly and primarily, the

state-of-the-art constraint-based casual inference algorithms are surveyed with

the detailed analysis. Thirdly, several related open-source software packages

and benchmark data repositories are briefly summarized. As a conclusion, some

open problems in constraint-based causal discovery are outlined for future

research.

### Multi-lingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment

Muhao Chen , Yingtao Tian , Mohan Yang , Carlo Zaniolo **Subjects** : Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)

Many recent works have demonstrated the benefits of knowledge graph

embeddings in completing monolingual knowledge graphs. Inasmuch as related

knowledge bases are built in several different languages, achieving

cross-lingual knowledge alignment will help people in constructing a coherent

knowledge base, and assist machines in dealing with different expressions of

entity relationships across diverse human languages. Unfortunately, achieving

this highly desirable crosslingual alignment by human labor is very costly and

errorprone. Thus, we propose MTransE, a translation-based model for

multilingual knowledge graph embeddings, to provide a simple and automated

solution. By encoding entities and relations of each language in a separated

embedding space, MTransE provides transitions for each embedding vector to its

cross-lingual counterparts in other spaces, while preserving the

functionalities of monolingual embeddings. We deploy three different techniques

to represent cross-lingual transitions, namely axis calibration, translation

vectors, and linear transformations, and derive five variants for MTransE using

different loss functions. Our models can be trained on partially aligned

graphs, where just a small portion of triples are aligned with their

cross-lingual counterparts. The experiments on cross-lingual entity matching

and triple-wise alignment verification show promising results, with some

variants consistently outperforming others on different tasks. We also explore

how MTransE preserves the key properties of its monolingual counterpart TransE.

### Reinforcement Learning of Contextual MDPs using Spectral Methods

Kamyar Azizzadenesheli , Alessandro Lazaric , Animashree Anandkumar **Subjects** : Artificial Intelligence (cs.AI) ; Learning (cs.LG); Machine Learning (stat.ML)

We propose a new reinforcement learning (RL) algorithm for contextual Markov

decision processes (CMDP) using spectral methods. CMDPs are structured MDPs

where the dynamics and rewards depend on a smaller number of hidden states or

contexts. If the mapping between the hidden and observed states is known a

priori, then standard RL algorithms such as UCRL are guaranteed to attain low

regret. Is it possible to achieve regret of the same order even when the

mapping is unknown? We provide an affirmative answer in this paper. We exploit

spectral methods to learn the mapping from hidden to observed states with

guaranteed confidence bounds, and incorporate it into the UCRL-based framework

to obtain order-optimal regret.

### Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

Melvin Johnson , Mike Schuster , Quoc V. Le , Maxim Krikun , Yonghui Wu , Zhifeng Chen , Nikhil Thorat , Fernanda Viégas , Martin Wattenberg , Greg Corrado , Macduff Hughes , Jeffrey Dean **Subjects** : Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)

We propose a simple, elegant solution to use a single Neural Machine

Translation (NMT) model to translate between multiple languages. Our solution

requires no change in the model architecture from our base system but instead

introduces an artificial token at the beginning of the input sentence to

specify the required target language. The rest of the model, which includes

encoder, decoder and attention, remains unchanged and is shared across all

languages. Using a shared wordpiece vocabulary, our approach enables

Multilingual NMT using a single model without any increase in parameters, which

is significantly simpler than previous proposals for Multilingual NMT. Our

method often improves the translation quality of all involved language pairs,

even while keeping the total number of model parameters constant. On the WMT’14

benchmarks, a single multilingual model achieves comparable performance for

English(

ightarrow)French and surpasses state-of-the-art results for

English(

ightarrow)German. Similarly, a single multilingual model surpasses

state-of-the-art results for French(

ightarrow)English and

German(

ightarrow)English on WMT’14 and WMT’15 benchmarks respectively. On

production corpora, multilingual models of up to twelve language pairs allow

for better translation of many individual pairs. In addition to improving the

translation quality of language pairs that the model was trained with, our

models can also learn to perform implicit bridging between language pairs never

seen explicitly during training, showing that transfer learning and zero-shot

translation is possible for neural translation. Finally, we show analyses that

hints at a universal interlingua representation in our models and show some

interesting examples when mixing languages.

### Learning the best algorithm for max-cut, clustering, and other partitioning problems

Maria-Florina Balcan , Vaishnavh Nagarajan , Ellen Vitercik , Colin White **Subjects** : Data Structures and Algorithms (cs.DS) ; Artificial Intelligence (cs.AI); Learning (cs.LG)

Many data analysis problems are NP-hard, a reality that has motivated

researchers to develop a wealth of approximation algorithms and heuristics over

the past few decades. Max-cut, clustering, and many other widely applicable

partitioning problems fall into this camp. We focus on two common classes of

algorithms that are used for clustering and partitioning problems, including

SDP rounding algorithms and agglomerative clustering algorithms with dynamic

programming. The best algorithm to use typically depends on the specific

application or distribution over instances. A worst-case analysis is often used

to compare algorithms, but worst-case instances may not be present in the

particular problem domain, and thus may be misleading when determining which

algorithm to apply. Therefore, it is necessary to develop optimization methods

which return the algorithms and parameters best suited for typical inputs from

the application at hand. We address this problem for integer quadratic

programming and clustering within a learning-theoretic framework where the

learning algorithm is given samples from an application-specific distribution

over problem instances and learns an algorithm which performs well over the

distribution. We provide sample complexity guarantees and computationally

efficient algorithms which optimize an algorithm family’s parameters to best

suit typical inputs from a particular application. We analyze these algorithms

using the learning-theoretic notion of pseudo-dimension. Along with upper

bounds, we provide the first pseudo-dimension lower bounds in this line of

work, which require an involved analysis of each algorithm family’s performance

on carefully constructed problem instances. Our bounds are tight and therefore

nail down the intrinsic complexity of the algorithm classes we analyze, which

are significantly more complex than classes commonly used in learning theory.

Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Hideki Nakayama , Noriki Nishida **Subjects** : Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

We propose an approach to build a neural machine translation system with no

supervised resources (i.e., no parallel corpora) using multimodal embedded

representation over texts and images. Based on the assumption that text

documents are often likely to be described with other multimedia information

(e.g., images) somewhat related to the content, we try to indirectly estimate

the relevance between two languages. Using multimedia as the “pivot”, we

project all modalities into one common hidden space where samples belonging to

similar semantic concepts should come close to each other, whatever the

observed space of each sample is. This modality-agnostic representation is the

key to bridging the gap between different modalities. Putting a decoder on top

of it, our network can flexibly draw the outputs from any input modality.

Notably, in the testing phase, we need only source language texts as the input

for translation. In experiments, we tested our method on two benchmarks to show

that it can achieve reasonable translation performance. We compared and

investigated several possible implementations and found that an end-to-end

model that simultaneously optimized both rank loss in multimodal encoders and

cross-entropy loss in decoders performed the best.

### Generate Models and Model Criticism via Optimized Maximum Mean Discrepancy

Comments: Submitted to ICLR 2017; public comments at this http URL

**Subjects**

:

Machine Learning (stat.ML)

; Artificial Intelligence (cs.AI); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Methodology (stat.ME)

We propose a method to optimize the representation and distinguishability of

samples from two probability distributions, by maximizing the estimated power

of a statistical test based on the maximum mean discrepancy (MMD). This

optimized MMD is applied to the setting of unsupervised learning by generative

adversarial networks (GAN), in which a model attempts to generate realistic

samples, and a discriminator attempts to tell these apart from data samples. In

this context, the MMD may be used in two roles: first, as a discriminator,

either directly on the samples, or on features of the samples. Second, the MMD

can be used to evaluate the performance of a generative model, by testing the

model’s samples against a reference data set. In the latter role, the optimized

MMD is particularly helpful, as it gives an interpretable indication of how the

model and data distributions differ, even in cases where individual model

samples are not easily distinguished either by eye or by classifier.

### Learning to Gather Information via Imitation

Sanjiban Choudhury , Ashish Kapoor , Gireeja Ranade , Debadeepta Dey **Subjects** : Robotics (cs.RO) ; Artificial Intelligence (cs.AI)

The budgeted information gathering problem – where a robot with a fixed fuel

budget is required to maximize the amount of information gathered from the

world – appears in practice across a wide range of applications in autonomous

exploration and inspection with mobile robots. Although there is an extensive

amount of prior work investigating effective approximations of the problem,

these methods do not address the fact that their performance is heavily

dependent on distribution of objects in the world. In this paper, we attempt to

address this issue by proposing a novel data-driven imitation learning

framework.

We present an efficient algorithm, EXPLORE, that trains a policy on the

target distribution to imitate a clairvoyant oracle – an oracle that has full

information about the world and computes non-myopic solutions to maximize

information gathered. We validate the approach on a spectrum of results on a

number of 2D and 3D exploration problems that demonstrates the ability of

EXPLORE to adapt to different object distributions. Additionally, our analysis

provides theoretical insight into the behavior of EXPLORE. Our approach paves

the way forward for efficiently applying data-driven methods to the domain of

information gathering.

### Recognizing and Eliciting Weakly Single Crossing Profiles on Trees

Palash Dey **Subjects** : Multiagent Systems (cs.MA) ; Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)

The domain of single crossing preference profiles is a widely studied domain

in social choice theory. It has been generalized to the domain of single

crossing preference profiles with respect to trees which inherits many

desirable properties from the single crossing domain, for example, transitivity

of majority relation, existence of polynomial time algorithms for finding

winners of Kemeny voting rule, etc. In this paper, we consider a further

generalization of the domain of single crossing profiles on trees to the domain

consisting of all preference profiles which can be extended to single crossing

preference profiles with respect to some tree by adding more preferences to it.

We call this domain the weakly single crossing domain on trees. We present a

polynomial time algorithm for recognizing weakly single crossing profiles on

trees. We then move on to develop a polynomial time algorithm with low query

complexity for eliciting weakly single crossing profiles on trees even when we

do not know any tree with respect to which the closure of the input profile is

single crossing and the preferences can be queried only sequentially; moreover,

the sequential order is also unknown. We complement the performance of our

preference elicitation algorithm by proving that our algorithm makes an optimal

number of queries up to constant factors when the number of preferences is

large compared to the number of candidates, even if the input profile is known

to be single crossing with respect to some given tree and the preferences can

be accessed randomly.

### Leveraging Video Descriptions to Learn Video Question Answering

Comments: 7 pages, 5 figures. Accepted to AAAI 2017

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

; Artificial Intelligence (cs.AI); Multimedia (cs.MM)

We propose a scalable approach to learn video-based question answering (QA):

answer a “free-form natural language question” about a video content. Our

approach automatically harvests a large number of videos and descriptions

freely available online. Then, a large number of candidate QA pairs are

automatically generated from descriptions rather than manually annotated. Next,

we use these candidate QA pairs to train a number of video-based QA methods

extended fromMN (Sukhbaatar et al. 2015), VQA (Antol et al. 2015), SA (Yao et

al. 2015), SS (Venugopalan et al. 2015). In order to handle non-perfect

candidate QA pairs, we propose a self-paced learning procedure to iteratively

identify them and mitigate their effects in training. Finally, we evaluate

performance on manually generated video-based QA pairs. The results show that

our self-paced learning procedure is effective, and the extended SS model

outperforms various baselines.

## Information Retrieval

### 1.5 billion words Arabic Corpus

Ibrahim Abu El-khair **Subjects** : Computation and Language (cs.CL) ; Digital Libraries (cs.DL); Information Retrieval (cs.IR)

This study is an attempt to build a contemporary linguistic corpus for Arabic

language. The corpus produced, is a text corpus includes more than five million

newspaper articles. It contains over a billion and a half words in total, out

of which, there is about three million unique words. The data were collected

from newspaper articles in ten major news sources from eight Arabic countries,

over a period of fourteen years. The corpus was encoded with two types of

encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two

mark-up languages, namely: SGML, and XML.

## Computation and Language

### Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

Melvin Johnson , Mike Schuster , Quoc V. Le , Maxim Krikun , Yonghui Wu , Zhifeng Chen , Nikhil Thorat , Fernanda Viégas , Martin Wattenberg , Greg Corrado , Macduff Hughes , Jeffrey Dean **Subjects** : Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI)

We propose a simple, elegant solution to use a single Neural Machine

Translation (NMT) model to translate between multiple languages. Our solution

requires no change in the model architecture from our base system but instead

introduces an artificial token at the beginning of the input sentence to

specify the required target language. The rest of the model, which includes

encoder, decoder and attention, remains unchanged and is shared across all

languages. Using a shared wordpiece vocabulary, our approach enables

Multilingual NMT using a single model without any increase in parameters, which

is significantly simpler than previous proposals for Multilingual NMT. Our

method often improves the translation quality of all involved language pairs,

even while keeping the total number of model parameters constant. On the WMT’14

benchmarks, a single multilingual model achieves comparable performance for

English(

ightarrow)French and surpasses state-of-the-art results for

English(

ightarrow)German. Similarly, a single multilingual model surpasses

state-of-the-art results for French(

ightarrow)English and

German(

ightarrow)English on WMT’14 and WMT’15 benchmarks respectively. On

production corpora, multilingual models of up to twelve language pairs allow

for better translation of many individual pairs. In addition to improving the

translation quality of language pairs that the model was trained with, our

models can also learn to perform implicit bridging between language pairs never

seen explicitly during training, showing that transfer learning and zero-shot

translation is possible for neural translation. Finally, we show analyses that

hints at a universal interlingua representation in our models and show some

interesting examples when mixing languages.

Zero-resource Machine Translation by Multimodal Encoder-decoder Network with Multimedia Pivot

Hideki Nakayama , Noriki Nishida **Subjects** : Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

We propose an approach to build a neural machine translation system with no

supervised resources (i.e., no parallel corpora) using multimodal embedded

representation over texts and images. Based on the assumption that text

documents are often likely to be described with other multimedia information

(e.g., images) somewhat related to the content, we try to indirectly estimate

the relevance between two languages. Using multimedia as the “pivot”, we

project all modalities into one common hidden space where samples belonging to

similar semantic concepts should come close to each other, whatever the

observed space of each sample is. This modality-agnostic representation is the

key to bridging the gap between different modalities. Putting a decoder on top

of it, our network can flexibly draw the outputs from any input modality.

Notably, in the testing phase, we need only source language texts as the input

for translation. In experiments, we tested our method on two benchmarks to show

that it can achieve reasonable translation performance. We compared and

investigated several possible implementations and found that an end-to-end

model that simultaneously optimized both rank loss in multimodal encoders and

cross-entropy loss in decoders performed the best.

### Multi-view Recurrent Neural Acoustic Word Embeddings

Comments: In submission to ICLR 2017

**Subjects**

:

Computation and Language (cs.CL)

Recent work has begun exploring neural acoustic word

embeddings–fixed-dimensional vector representations of arbitrary-length speech

segments corresponding to words. Such embeddings are applicable to speech

retrieval and recognition tasks, where reasoning about whole words may make it

possible to avoid ambiguous sub-word representations. The main idea is to map

acoustic sequences to fixed-dimensional vectors such that examples of the same

word are mapped to similar vectors, while different-word examples are mapped to

very different vectors. In this work we take a multi-view approach to learning

acoustic word embeddings, in which we jointly learn to embed acoustic sequences

and their corresponding character sequences. We use deep bidirectional LSTM

embedding models and multi-view contrastive losses. We study the effect of

different loss variants, including fixed-margin and cost-sensitive losses. Our

acoustic word embeddings improve over previous approaches for the task of word

discrimination. We also present results on other tasks that are enabled by the

multi-view approach, including cross-view word discrimination and word

similarity.

### Ranking medical jargon in electronic health record notes by adapted distant supervision

Jinying Chen , Abhyuday N. Jagannatha , Samah J. Jarad , Hong Yu **Subjects** : Computation and Language (cs.CL)

Objective: Allowing patients to access their own electronic health record

(EHR) notes through online patient portals has the potential to improve

patient-centered care. However, medical jargon, which abounds in EHR notes, has

been shown to be a barrier for patient EHR comprehension. Existing knowledge

bases that link medical jargon to lay terms or definitions play an important

role in alleviating this problem but have low coverage of medical jargon in

EHRs. We developed a data-driven approach that mines EHRs to identify and rank

medical jargon based on its importance to patients, to support the building of

EHR-centric lay language resources.

Methods: We developed an innovative adapted distant supervision (ADS) model

based on support vector machines to rank medical jargon from EHRs. For distant

supervision, we utilized the open-access, collaborative consumer health

vocabulary, a large, publicly available resource that links lay terms to

medical jargon. We explored both knowledge-based features from the Unified

Medical Language System and distributed word representations learned from

unlabeled large corpora. We evaluated the ADS model using physician-identified

important medical terms.

Results: Our ADS model significantly surpassed two state-of-the-art automatic

term recognition methods, TF*IDF and C-Value, yielding 0.810 ROC-AUC versus

0.710 and 0.667, respectively. Our model identified 10K important medical

jargon terms after ranking over 100K candidate terms mined from over 7,500 EHR

narratives.

Conclusion: Our work is an important step towards enriching lexical resources

that link medical jargon to lay terms/definitions to support patient EHR

comprehension. The identified medical jargon terms and their rankings are

available upon request.

### Attending to Characters in Neural Sequence Labeling Models

Comments: Proceedings of COLING 2016

**Subjects**

:

Computation and Language (cs.CL)

; Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Sequence labeling architectures use word embeddings for capturing similarity,

but suffer when handling previously unseen or rare words. We investigate

character-level extensions to such models and propose a novel architecture for

combining alternative word representations. By using an attention mechanism,

the model is able to dynamically decide how much information to use from a

word- or character-level component. We evaluated different architectures on a

range of sequence labeling datasets, and character-level extensions were found

to improve performance on every benchmark. In addition, the proposed

attention-based architecture delivered the best results even with a smaller

number of trainable parameters.

### Character-level Convolutional Network for Text Classification Applied to Chinese Corpus

Weijie Huang **Subjects** : Computation and Language (cs.CL)

This article provides an interesting exploration of character-level

convolutional neural network solving Chinese corpus text classification

problem. We constructed a large-scale Chinese language dataset, and the result

shows that character-level convolutional neural network works better on Chinese

corpus than its corresponding pinyin format dataset. This is the first time

that character-level convolutional neural network applied to text

classification problem.

Comments: This paper will be presented at ExPROM workshop at COLING 2016

**Subjects**

:

Computation and Language (cs.CL)

Topic Models have been reported to be beneficial for aspect-based sentiment

analysis. This paper reports a simple topic model for sarcasm detection, a

first, to the best of our knowledge. Designed on the basis of the intuition

that sarcastic tweets are likely to have a mixture of words of both sentiments

as against tweets with literal sentiment (either positive or negative), our

hierarchical topic model discovers sarcasm-prevalent topics and topic-level

sentiment. Using a dataset of tweets labeled using hashtags, the model

estimates topic-level, and sentiment-level distributions. Our evaluation shows

that topics such as `work’, `gun laws’, `weather’ are sarcasm-prevalent topics.

Our model is also able to discover the mixture of sentiment-bearing words that

exist in a text of a given sentiment-related label. Finally, we apply our model

to predict sarcasm in tweets. We outperform two prior work based on statistical

classifiers with specific features, by around 25/%.

### Classify or Select: Neural Architectures for Extractive Document Summarization

Ramesh Nallapati , Bowen Zhou , Mingbo Ma **Subjects** : Computation and Language (cs.CL)

We present two novel and contrasting Recurrent Neural Network (RNN) based

architectures for extractive summarization of documents. The Classifier based

architecture sequentially accepts or rejects each sentence in the original

document order for its membership in the final summary. The Selector

architecture, on the other hand, is free to pick one sentence at a time in any

arbitrary order to piece together the summary.

Our models under both architectures jointly capture the notions of salience

and redundancy of sentences. In addition, these models have the advantage of

being very interpretable, since they allow visualization of their predictions

broken up by abstract features such as information content, salience and

redundancy.

We show that our models reach or outperform state-of-the-art supervised

models on two different corpora. We also recommend the conditions under which

one architecture is superior to the other based on experimental evidence.

F-Score Driven Max Margin Neural Network for Named Entity Recognition in Chinese Social Media

Hangfeng He , Xu Sun **Subjects** : Computation and Language (cs.CL)

We focus on named entity recognition (NER) for Chinese social media. With

massive unlabeled text and quite limited labelled corpus, we propose a

semi-supervised learning model based on B-LSTM neural network. To take

advantage of traditional methods in NER such as CRF, we combine transition

probability with deep learning in our model. To bridge the gap between label

accuracy and F-score of NER, we construct a model which can be directly trained

on F-score. When considering the instability of F-score driven method and

meaningful information provided by label accuracy, we propose an integrated

method to train on both F-score and label accuracy. Our integrated model yields

7.44/% improvement over previous state-of-the-art result.

### A New Recurrent Neural CRF for Learning Non-linear Edge Features

Shuming Ma , Xu Sun **Subjects** : Computation and Language (cs.CL)

Conditional Random Field (CRF) and recurrent neural models have achieved

success in structured prediction. More recently, there is a marriage of CRF and

recurrent neural models, so that we can gain from both non-linear dense

features and globally normalized CRF objective. These recurrent neural CRF

models mainly focus on encode node features in CRF undirected graphs. However,

edge features prove important to CRF in structured prediction. In this work, we

introduce a new recurrent neural CRF model, which learns non-linear edge

features, and thus makes non-linear features encoded completely. We compare our

model with different neural models in well-known structured prediction tasks.

Experiments show that our model outperforms state-of-the-art methods in NP

chunking, shallow parsing, Chinese word segmentation and POS tagging.

Comments: Published at AAAI 2017, The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-2017)

**Subjects**

:

Computation and Language (cs.CL)

We present SummaRuNNer, a Recurrent Neural Network (RNN) based sequence model

for extractive summarization of documents and show that it achieves performance

better than or comparable to state-of-the-art. Our model has the additional

advantage of being very interpretable, since it allows visualization of its

predictions broken up by abstract features such as information content,

salience and novelty. Another novel contribution of our work is abstractive

training of our extractive model that can train on human generated reference

summaries alone, eliminating the need for sentence-level extractive labels.

### Joint Representation Learning of Text and Knowledge for Knowledge Graph Completion

Xu Han , Zhiyuan Liu , Maosong Sun **Subjects** : Computation and Language (cs.CL)

Joint representation learning of text and knowledge within a unified semantic

space enables us to perform knowledge graph completion more accurately. In this

work, we propose a novel framework to embed words, entities and relations into

the same continuous vector space. In this model, both entity and relation

embeddings are learned by taking knowledge graph and plain text into

consideration. In experiments, we evaluate the joint learning model on three

tasks including entity prediction, relation prediction and relation

classification from text. The experiment results show that our model can

significantly and consistently improve the performance on the three tasks as

compared with other baselines.

### Cross-lingual Dataless Classification for Languages with Small Wikipedia Presence

Yangqiu Song , Stephen Mayhew , Dan Roth **Subjects** : Computation and Language (cs.CL)

This paper presents an approach to classify documents in any language into an

English topical label space, without any text categorization training data. The

approach, Cross-Lingual Dataless Document Classification (CLDDC) relies on

mapping the English labels or short category description into a Wikipedia-based

semantic representation, and on the use of the target language Wikipedia.

Consequently, performance could suffer when Wikipedia in the target language is

small. In this paper, we focus on languages with small Wikipedias,

(Small-Wikipedia languages, SWLs). We use a word-level dictionary to convert

documents in a SWL to a large-Wikipedia language (LWLs), and then perform CLDDC

based on the LWL’s Wikipedia. This approach can be applied to thousands of

languages, which can be contrasted with machine translation, which is a

supervision heavy approach and can be done for about 100 languages. We also

develop a ranking algorithm that makes use of language similarity metrics to

automatically select a good LWL, and show that this significantly improves

classification of SWLs’ documents, performing comparably to the best bridge

possible.

### Semi-automatic Simultaneous Interpreting Quality Evaluation

Journal-ref: International Journal on Natural Language Computing (IJNLC) Vol.

5, No.5, October 2016

**Subjects**

:

Computation and Language (cs.CL)

Increasing interpreting needs a more objective and automatic measurement. We

hold a basic idea that ‘translating means translating meaning’ in that we can

assessment interpretation quality by comparing the meaning of the interpreting

output with the source input. That is, a translation unit of a ‘chunk’ named

Frame which comes from frame semantics and its components named Frame Elements

(FEs) which comes from Frame Net are proposed to explore their matching rate

between target and source texts. A case study in this paper verifies the

usability of semi-automatic graded semantic-scoring measurement for human

simultaneous interpreting and shows how to use frame and FE matches to score.

Experiments results show that the semantic-scoring metrics have a significantly

correlation coefficient with human judgment.

### 1.5 billion words Arabic Corpus

Ibrahim Abu El-khair **Subjects** : Computation and Language (cs.CL) ; Digital Libraries (cs.DL); Information Retrieval (cs.IR)

This study is an attempt to build a contemporary linguistic corpus for Arabic

language. The corpus produced, is a text corpus includes more than five million

newspaper articles. It contains over a billion and a half words in total, out

of which, there is about three million unique words. The data were collected

from newspaper articles in ten major news sources from eight Arabic countries,

over a period of fourteen years. The corpus was encoded with two types of

encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two

mark-up languages, namely: SGML, and XML.

### Multi-Language Identification Using Convolutional Recurrent Neural Network

Comments: 10 pages, 6 figures

**Subjects**

:

Computation and Language (cs.CL)

Language Identification, being an important aspect of Automatic Speaker

Recognition has had many changes and new approaches to ameliorate performance

over the last decade. We compare the performance of using audio spectrum in the

log scale and using Polyphonic sound sequences from raw audio samples to train

the neural network and to classify speech as either English or Spanish. To

achieve this, we use the novel approach of using a Convolutional Recurrent

Neural Network using Long Short Term Memory (LSTM) or a Gated Recurrent Unit

(GRU) for forward propagation of the neural network. Our hypothesis is that the

performance of using polyphonic sound sequence as features and both LSTM and

GRU as the gating mechanisms for the neural network outperform the traditional

MFCC features using a unidirectional Deep Neural Network.

### Linguistically Regularized LSTMs for Sentiment Classification

Qiao Qian , Minlie Huang , Xiaoyan Zhu **Subjects** : Computation and Language (cs.CL)

Sentiment understanding has been a long-term goal of AI in the past decades.

This paper deals with sentence-level sentiment classification. Though a variety

of neural network models have been proposed very recently, however, previous

models either depend on expensive phrase-level annotation, whose performance

drops substantially when trained with only sentence-level annotation; or do not

fully employ linguistic resources (e.g., sentiment lexicons, negation words,

intensity words), thus not being able to produce linguistically coherent

representations. In this paper, we propose simple models trained with

sentence-level annotation, but also attempt to generating linguistically

coherent representations by employing regularizers that model the linguistic

role of sentiment lexicons, negation words, and intensity words. Results show

that our models are effective to capture the sentiment shifting effect of

sentiment, negation, and intensity words, while still obtain competitive

results without sacrificing the models’ simplicity.

### Training IBM Watson using Automatically Generated Question-Answer Pairs

Jangho Lee , Gyuwan Kim , Jaeyoon Yoo , Changwoo Jung , Minseok Kim , Sungroh Yoon **Subjects** : Computation and Language (cs.CL)

IBM Watson is a cognitive computing system capable of question answering in

natural languages. It is believed that IBM Watson can understand large corpora

and answer relevant questions more effectively than any other

question-answering system currently available. To unleash the full power of

Watson, however, we need to train its instance with a large number of

well-prepared question-answer pairs. Obviously, manually generating such pairs

in a large quantity is prohibitively time consuming and significantly limits

the efficiency of Watson’s training. Recently, a large-scale dataset of over 30

million question-answer pairs was reported. Under the assumption that using

such an automatically generated dataset could relieve the burden of manual

question-answer generation, we tried to use this dataset to train an instance

of Watson and checked the training efficiency and accuracy. According to our

experiments, using this auto-generated dataset was effective for training

Watson, complementing manually crafted question-answer pairs. To the best of

the authors’ knowledge, this work is the first attempt to use a large-scale

dataset of automatically generated question-answer pairs for training IBM

Watson. We anticipate that the insights and lessons obtained from our

experiments will be useful for researchers who want to expedite Watson training

leveraged by automatically generated question-answer pairs.

### Multi-lingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment

Muhao Chen , Yingtao Tian , Mohan Yang , Carlo Zaniolo **Subjects** : Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL)

Many recent works have demonstrated the benefits of knowledge graph

embeddings in completing monolingual knowledge graphs. Inasmuch as related

knowledge bases are built in several different languages, achieving

cross-lingual knowledge alignment will help people in constructing a coherent

knowledge base, and assist machines in dealing with different expressions of

entity relationships across diverse human languages. Unfortunately, achieving

this highly desirable crosslingual alignment by human labor is very costly and

errorprone. Thus, we propose MTransE, a translation-based model for

multilingual knowledge graph embeddings, to provide a simple and automated

solution. By encoding entities and relations of each language in a separated

embedding space, MTransE provides transitions for each embedding vector to its

cross-lingual counterparts in other spaces, while preserving the

functionalities of monolingual embeddings. We deploy three different techniques

to represent cross-lingual transitions, namely axis calibration, translation

vectors, and linear transformations, and derive five variants for MTransE using

different loss functions. Our models can be trained on partially aligned

graphs, where just a small portion of triples are aligned with their

cross-lingual counterparts. The experiments on cross-lingual entity matching

and triple-wise alignment verification show promising results, with some

variants consistently outperforming others on different tasks. We also explore

how MTransE preserves the key properties of its monolingual counterpart TransE.

## Distributed, Parallel, and Cluster Computing

### Byzantine Processors and Cuckoo Birds: Confining Maliciousness to the Outset

Comments: arXiv admin note: text overlap with arXiv:1607.01210

**Subjects**

:

Distributed, Parallel, and Cluster Computing (cs.DC)

Are there Byzantine Animals? A Fooling Behavior is exhibited by the Cuckoo

bird. It sneakily replaces some of the eggs of other species with its own. Lest

the Cuckoo extinct itself by destroying its host, it self-limits its power: It

does not replace too large a fraction of the eggs. Here, we show that any

Byzantine Behavior that does not destroy the system it attacks, i.e. allows the

system to solve an easy task like epsilon-agreement, then its maliciousness can

be confined to be the exact replica of the Cuckoo bird behavior: Undetectably

replace an input of a processor and let the processor behave correctly

thereafter with respect to the new input. In doing so we reduce the study of

Byzantine behavior to fail-stop (benign) behavior with the Cuckoo caveat of a

fraction of the inputs replaced. We establish a complete correspondence between

the Byzantine and the Benign, modulo different thresholds, and replaced inputs.

This work is yet another step in a line of work unifying seemingly distinct

distributed system models, dispelling the Myth that Distributed Computing is a

plethora of distinct isolated models, each requiring its specialized tools and

ideas in order to determine solvability of tasks. Thus, hereafter, Byzantine

Computability questions can be reduced to questions in the benign failure

setting. We also show that the known results about correlated faults in the

asynchronous benign setting can be imported verbatim to the asynchronous

Byzantine setting. Finally, as in the benign case in which we have the property

that a processor can output once its faulty behavior stops for long enough, we

show this can be done in a similar manner in the Byzantine case. This

necessitated the generalization of Reliable Broadcast to what we term

Recoverable Reliable Broadcast.

### Efficient Communications in Training Large Scale Neural Networks

Linnan Wang , Wei Wu , George Bosilca , Richard Vuduc , Zenglin Xu **Subjects** : Distributed, Parallel, and Cluster Computing (cs.DC)

We consider the problem of how to reduce the cost of communication that is

required for the parallel training of a neural network. The state-of-the-art

method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD),

requires many collective communication operations, like broadcasts of

parameters or reductions for sub-gradient aggregations, which for large

messages quickly dominates overall execution time and limits parallel

scalability. To address this problem, we develop a new technique for collective

operations, referred to as Linear Pipelining (LP). It is tuned to the message

sizes that arise in BSP-SGD, and works effectively on multi-GPU systems.

Theoretically, the cost of LP is invariant to (P), where (P) is the number of

GPUs, while the cost of more conventional Minimum Spanning Tree (MST) scales

like (O(log P)). LP also demonstrate up to 2x faster bandwidth than

Bidirectional Exchange (BE) techniques that are widely adopted by current MPI

implementations. We apply these collectives to BSP-SGD, showing that the

proposed implementations reduce communication bottlenecks in practice while

preserving the attractive convergence properties of BSP-SGD.

### A parallel workload has extreme varibility

Comments: 7 pages, 2 figures, 1 code listing

**Subjects**

:

Distributed, Parallel, and Cluster Computing (cs.DC)

; Data Analysis, Statistics and Probability (physics.data-an); Computation (stat.CO)

In both high-performance computing (HPC) environments and the public cloud,

the duration of time to retrieve or save your results is simultaneously

unpredictable and important to your over all resource budget. It is generally

accepted (“Google: Taming the Long Latency Tail – When More Machines Equals

Worse Results”, Todd Hoff, highscalability.com 2012), but without a robust

explanation, that identical parallel tasks do take different durations to

complete — a phenomena known as variability. This paper advances understanding

of this topic. We carefully choose a model from which system-level complexity

emerges that can be studied directly. We find that a generalized extreme value

(GEV) model for variability naturally emerges. Using the public cloud, we find

real-world observations have excellent agreement with our model. Since the GEV

distribution is a limit distribution this suggests a universal property of

parallel systems gated by the slowest communication element of some sort.

Hence, this model is applicable to a variety of processing and IO tasks in

parallel environments. These findings have important implications, ranging from

characterizing ideal performance for parallel codes to detecting degraded

behaviour at extreme scales.

### Timestamps for Partial Replication

Zhuolun Xiang , Nitin H. Vaidya **Subjects** : Distributed, Parallel, and Cluster Computing (cs.DC)

Maintaining causal consistency in distributed shared memory systems using

vector timestamps has received a lot of attention from both theoretical and

practical prospective. However, most of the previous literature focuses on full

replication where each data is stored in all replicas, which may not be

scalable due to the increasing amount of data. In this report, we investigate

how to achieve causal consistency in partial replicated systems, where each

replica may store different set of data. We propose an algorithm that tracks

causal dependencies via vector timestamp in client-server model for partial

replication. The cost of our algorithm in terms of timestamps size varies as a

function of the manner in which the replicas share data, and the set of

replicas accessed by each client. We also establish a connection between our

algorithm with the previous work on full replication.

### Maintaining Acyclicity of Concurrent Graphs

Comments: 23 pages 7 figures

**Subjects**

:

Distributed, Parallel, and Cluster Computing (cs.DC)

In this paper, we consider the problem of preserving acyclicity in a directed

graph (for shared memory architecture) that is concurrently being updated by

threads adding/deleting vertices and edges. To the best of our knowledge, no

previous paper has presented a concurrent graph data structure. We implement

the concurrent directed graph data-structure as a concurrent adjacency list

representation. We extend the lazy list implementation of concurrent linked

lists for maintaining concurrent adjacency lists. There exists a number of

graph applications which require the acyclic invariant in a directed graph. One

such example is Serialization Graph Testing Algorithm used in databases and

transactional memory. We present two concurrent algorithms for maintaining

acyclicity in a concurrent graph: (i) Based on obstruction-free snapshots (ii)

Using wait-free reachability. We compare the performance of these algorithms

against the coarse-grained locking strategy, commonly used technique for

allowing concurrent updates. We present the speedup obtained by these

algorithms over sequential execution. As a future direction, we plan to extend

this data structure for other progress conditions.

### On Location Hiding in Distributed Systems

Comments: Submitted to 10th International Conference on Algorithms and Complexity CIAC 2017

**Subjects**

:

Multiagent Systems (cs.MA)

; Distributed, Parallel, and Cluster Computing (cs.DC)

We consider the following problem – a group of mobile agents perform some

task on a terrain modeled as a graph. In a given moment of time an adversary

gets an access to the graph and positions of the agents. Shortly before

adversary’s observation the mobile agents have a chance to relocate themselves

in order to hide their initial configuration. We assume that the initial

configuration may possibly reveal to the adversary some information about the

task they performed. Clearly agents have to change their location in possibly

short time using minimal energy. In our paper we introduce a definition of a

emph{well hiding} algorithm in which the starting and final configurations of

the agents have small mutual information. Then we discuss the influence of

various features of the model on the running time of the optimal well-hiding

algorithm. We show that if the topology of the graph is known to the agents,

then the number of steps proportional to the diameter of the graph is

sufficient and necessary. In the unknown topology scenario we only consider a

single agent case. We first show that the task is impossible in the

deterministic case if the agent has no memory. Then we present a polynomial

randomized algorithm. Finally in the model with memory we show that the number

of steps proportional to the number of edges of the graph is sufficient and

necessary. In some sense we investigate how complex is the problem of “losing”

information about location (both physical and logical) for different settings.

## Learning

### How to scale distributed deep learning?

Comments: Extended version of paper accepted at ML Sys 2016 (at NIPS 2016)

**Subjects**

:

Learning (cs.LG)

Training time on large datasets for deep neural networks is the principal

workflow bottleneck in a number of important applications of deep learning,

such as object classification and detection in automatic driver assistance

systems (ADAS). To minimize training time, the training of a deep neural

network must be scaled beyond a single machine to as many machines as possible

by distributing the optimization method used for training. While a number of

approaches have been proposed for distributed stochastic gradient descent

(SGD), at the current time synchronous approaches to distributed SGD appear to

be showing the greatest performance at large scale. Synchronous scaling of SGD

suffers from the need to synchronize all processors on each gradient step and

is not resilient in the face of failing or lagging processors. In asynchronous

approaches using parameter servers, training is slowed by contention to the

parameter server. In this paper we compare the convergence of synchronous and

asynchronous SGD for training a modern ResNet network architecture on the

ImageNet classification problem. We also propose an asynchronous method,

gossiping SGD, that aims to retain the positive features of both systems by

replacing the all-reduce collective operation of synchronous training with a

gossip aggregation algorithm. We find, perhaps counterintuitively, that

asynchronous SGD, including both elastic averaging and gossiping, converges

faster at fewer nodes (up to about 32 nodes), whereas synchronous SGD scales

better to more nodes (up to about 100 nodes).

### Earliness-Aware Deep Convolutional Networks for Early Time Series Classification

Wenlin Wang , Changyou Chen , Wenqi Wang , Piyush Rai , Lawrence Carin **Subjects** : Learning (cs.LG)

We present Earliness-Aware Deep Convolutional Networks (EA-ConvNets), an

end-to-end deep learning framework, for early classification of time series

data. Unlike most existing methods for early classification of time series

data, that are designed to solve this problem under the assumption of the

availability of a good set of pre-defined (often hand-crafted) features, our

framework can jointly perform feature learning (by learning a deep hierarchy of

emph{shapelets} capturing the salient characteristics in each time series),

along with a dynamic truncation model to help our deep feature learning

architecture focus on the early parts of each time series. Consequently, our

framework is able to make highly reliable early predictions, outperforming

various state-of-the-art methods for early time series classification, while

also being competitive when compared to the state-of-the-art time series

classification algorithms that work with emph{fully observed} time series

data. To the best of our knowledge, the proposed framework is the first to

perform data-driven (deep) feature learning in the context of early

classification of time series data. We perform a comprehensive set of

experiments, on several benchmark data sets, which demonstrate that our method

yields significantly better predictions than various state-of-the-art methods

designed for early time series classification. In addition to obtaining high

accuracies, our experiments also show that the learned deep shapelets based

features are also highly interpretable and can help gain better understanding

of the underlying characteristics of time series data.

### Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes

Mengye Ren , Renjie Liao , Raquel Urtasun , Fabian H. Sinz , Richard S. Zemel **Subjects** : Learning (cs.LG)

Normalization techniques have only recently begun to be exploited in

supervised learning tasks. Batch normalization exploits mini-batch statistics

to normalize the activations. This was shown to speed up training and result in

better models. However its success has been very limited when dealing with

recurrent neural networks. On the other hand, layer normalization normalizes

the activations across all activities within a layer. This was shown to work

well in the recurrent setting. In this paper we propose a unified view of

normalization techniques, as forms of divisive normalization, which includes

layer and batch normalization as special cases. Our second contribution is the

finding that a small modification to these normalization schemes, in

conjunction with a sparse regularizer on the activations, leads to significant

benefits over standard normalization techniques. We demonstrate the

effectiveness of our unified divisive normalization framework in the context of

convolutional neural nets and recurrent neural networks, showing improvements

over baselines in image classification, language modeling as well as

super-resolution.

### On the Quantitative Analysis of Decoder-Based Generative Models

Yuhuai Wu , Yuri Burda , Ruslan Salakhutdinov , Roger Grosse **Subjects** : Learning (cs.LG)

The past several years have seen remarkable progress in generative models

which produce convincing samples of images and other modalities. A shared

component of many powerful generative models is a decoder network, a parametric

deep neural net that defines a generative distribution. Examples include

variational autoencoders, generative adversarial networks, and generative

moment matching networks. Unfortunately, it can be difficult to quantify the

performance of these models because of the intractability of log-likelihood

estimation, and inspecting samples can be misleading. We propose to use

Annealed Importance Sampling for evaluating log-likelihoods for decoder-based

models and validate its accuracy using bidirectional Monte Carlo. Using this

technique, we analyze the performance of decoder-based models, the

effectiveness of existing log-likelihood estimators, the degree of overfitting,

and the degree to which these models miss important modes of the data

distribution.

### Identity Matters in Deep Learning

Moritz Hardt , Tengyu Ma **Subjects** : Learning (cs.LG) ; Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

An emerging design principle in deep learning is that each layer of a deep

artificial neural network should be able to easily express the identity

transformation. This idea not only motivated various normalization techniques,

such as emph{batch normalization}, but was also key to the immense success of

emph{residual networks}.

In this work, we put the principle of emph{identity parameterization} on a

more solid theoretical footing alongside further empirical progress. We first

give a strikingly simple proof that arbitrarily deep linear residual networks

have no spurious local optima. The same result for linear feed-forward networks

in their standard parameterization is substantially more delicate. Second, we

show that residual networks with ReLu activations have universal finite-sample

expressivity in the sense that the network can represent any function of its

sample provided that the model has more parameters than the sample size.

Directly inspired by our theory, we experiment with a radically simple

residual architecture consisting of only residual convolutional layers and ReLu

activations, but no batch normalization, dropout, or max pool. Our model

improves significantly on previous all-convolutional networks on the CIFAR10,

CIFAR100, and ImageNet classification benchmarks.

### Learning Sparse, Distributed Representations using the Hebbian Principle

Aseem Wadhwa , Upamanyu Madhow **Subjects** : Learning (cs.LG)

The “fire together, wire together” Hebbian model is a central principle for

learning in neuroscience, but surprisingly, it has found limited applicability

in modern machine learning. In this paper, we take a first step towards

bridging this gap, by developing flavors of competitive Hebbian learning which

produce sparse, distributed neural codes using online adaptation with minimal

tuning. We propose an unsupervised algorithm, termed Adaptive Hebbian Learning

(AHL). We illustrate the distributed nature of the learned representations via

output entropy computations for synthetic data, and demonstrate superior

performance, compared to standard alternatives such as autoencoders, in

training a deep convolutional net on standard image datasets.

### (CAD)(^2)RL: Real Single-Image Flight without a Single Real Image

Comments: 11 pages

**Subjects**

:

Learning (cs.LG)

; Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

We propose (CAD)(^2)RL, a flight controller for Collision Avoidance via Deep

Reinforcement Learning that can be used to perform collision-free flight in the

real world although it is trained entirely in a 3D CAD model simulator. Our

method uses only single RGB images from a monocular camera mounted on the robot

as the input and is specialized for indoor hallway following and obstacle

avoidance. In contrast to most indoor navigation techniques that aim to

directly reconstruct the 3D geometry of the environment, our approach directly

predicts the probability of collision given the current monocular image and a

candidate action. To obtain accurate predictions, we develop a deep

reinforcement learning algorithm for learning indoor navigation, which uses the

actual performance of the current policy to construct accurate supervision. The

collision prediction model is represented by a deep convolutional neural

network that directly processes raw image inputs. Our collision avoidance

system is entirely trained in simulation and thus addresses the high sample

complexity of deep reinforcement learning and avoids the dangers of

trial-and-error learning in the real world. By highly randomizing the rendering

settings for our simulated training set, we show that we can train a collision

predictor that generalizes to new environments with substantially different

appearance from the training scenarios. Finally, we evaluate our method in the

real world by controlling a real quadrotor flying through real hallways. We

demonstrate that our method can perform real-world collision avoidance and

hallway following after being trained exclusively on synthetic images, without

ever having seen a single real image at the training time. For supplementary

video see: this https URL

### Realistic risk-mitigating recommendations via inverse classification

Michael T. Lash , W. Nick Street **Subjects** : Learning (cs.LG) ; Machine Learning (stat.ML)

Inverse classification, the process of making meaningful perturbations to a

test point such that it is more likely to have a desired classification, has

previously been addressed using data from a single static point in time. Such

an approach yields inflated probability estimates, stemming from an implicitly

made assumption that recommendations are implemented instantaneously. We

propose using longitudinal data to alleviate such issues in two ways. First, we

use past outcome probabilities as features in the present. Use of such past

probabilities ties historical behavior to the present, allowing for more

information to be taken into account when making initial probability estimates

and subsequently performing inverse classification. Secondly, following inverse

classification application, optimized instances’ unchangeable features

(e.g.,~age) are updated using values from the next longitudinal time period.

Optimized test instance probabilities are then reassessed. Updating the

unchangeable features in this manner reflects the notion that improvements in

outcome likelihood, which result from following the inverse classification

recommendations, do not materialize instantaneously. As our experiments

demonstrate, more realistic estimates of probability can be obtained by

factoring in such considerations.

### Batched Gaussian Process Bandit Optimization via Determinantal Point Processes

Comments: To appear at NIPS 2016

**Subjects**

:

Learning (cs.LG)

Gaussian Process bandit optimization has emerged as a powerful tool for

optimizing noisy black box functions. One example in machine learning is

hyper-parameter optimization where each evaluation of the target function

requires training a model which may involve days or even weeks of computation.

Most methods for this so-called “Bayesian optimization” only allow sequential

exploration of the parameter space. However, it is often desirable to propose

batches or sets of parameter values to explore simultaneously, especially when

there are large parallel processing facilities at our disposal. Batch methods

require modeling the interaction between the different evaluations in the

batch, which can be expensive in complex scenarios. In this paper, we propose a

new approach for parallelizing Bayesian optimization by modeling the diversity

of a batch via Determinantal point processes (DPPs) whose kernels are learned

automatically. This allows us to generalize a previous result as well as prove

better regret bounds based on DPP sampling. Our experiments on a variety of

synthetic and real-world robotics and hyper-parameter optimization tasks

indicate that our DPP-based methods, especially those based on DPP sampling,

outperform state-of-the-art methods.

### Prognostics of Surgical Site Infections using Dynamic Health Data

Comments: 23 pages, 8 figures

**Subjects**

:

Learning (cs.LG)

Surgical Site Infection (SSI) is a national priority in healthcare research.

Much research attention has been attracted to develop better SSI risk

prediction models. However, most of the existing SSI risk prediction models are

built on static risk factors such as comorbidities and operative factors. In

this paper, we investigate the use of the dynamic wound data for SSI risk

prediction. There have been emerging mobile health (mHealth) tools that can

closely monitor the patients and generate continuous measurements of many

wound-related variables and other evolving clinical variables. Since existing

prediction models of SSI have quite limited capacity to utilize the evolving

clinical data, we develop the corresponding solution to equip these mHealth

tools with decision-making capabilities for SSI prediction with a seamless

assembly of several machine learning models to tackle the analytic challenges

arising from the spatial-temporal data. The basic idea is to exploit the

low-rank property of the spatial-temporal data via the bilinear formulation,

and further enhance it with automatic missing data imputation by the matrix

completion technique. We derive efficient optimization algorithms to implement

these models and demonstrate the superior performances of our new predictive

model on a real-world dataset of SSI, compared to a range of state-of-the-art

methods.

### Dual Teaching: A Practical Semi-supervised Wrapper Method

Comments: 7 pages, 4 figures, 1 table

**Subjects**

:

Learning (cs.LG)

; Machine Learning (stat.ML)

Semi-supervised wrapper methods are concerned with building effective

supervised classifiers from partially labeled data. Though previous works have

succeeded in some fields, it is still difficult to apply semi-supervised

wrapper methods to practice because the assumptions those methods rely on tend

to be unrealistic in practice. For practical use, this paper proposes a novel

semi-supervised wrapper method, Dual Teaching, whose assumptions are easy to

set up. Dual Teaching adopts two external classifiers to estimate the false

positives and false negatives of the base learner. Only if the recall of every

external classifier is greater than zero and the sum of the precision is

greater than one, Dual Teaching will train a base learner from partially

labeled data as effectively as the fully-labeled-data-trained classifier. The

effectiveness of Dual Teaching is proved in both theory and practice.

### Anomaly Detection in Bitcoin Network Using Unsupervised Learning Methods

Thai Pham , Steven Lee **Subjects** : Learning (cs.LG) ; Cryptography and Security (cs.CR)

The problem of anomaly detection has been studied for a long time. In short,

anomalies are abnormal or unlikely things. In financial networks, thieves and

illegal activities are often anomalous in nature. Members of a network want to

detect anomalies as soon as possible to prevent them from harming the network’s

community and integrity. Many Machine Learning techniques have been proposed to

deal with this problem; some results appear to be quite promising but there is

no obvious superior method. In this paper, we consider anomaly detection

particular to the Bitcoin transaction network. Our goal is to detect which

users and transactions are the most suspicious; in this case, anomalous

behavior is a proxy for suspicious behavior. To this end, we use three

unsupervised learning methods including k-means clustering, Mahalanobis

distance, and Unsupervised Support Vector Machine (SVM) on two graphs generated

by the Bitcoin transaction network: one graph has users as nodes, and the other

has transactions as nodes.

### Personalized Donor-Recipient Matching for Organ Transplantation

Jinsung Yoon , Ahmed M. Alaa , Martin Cadeiras , Mihaela van der Schaar **Subjects** : Learning (cs.LG)

Organ transplants can improve the life expectancy and quality of life for the

recipient but carries the risk of serious post-operative complications, such as

septic shock and organ rejection. The probability of a successful transplant

depends in a very subtle fashion on compatibility between the donor and the

recipient but current medical practice is short of domain knowledge regarding

the complex nature of recipient-donor compatibility. Hence a data-driven

approach for learning compatibility has the potential for significant

improvements in match quality. This paper proposes a novel system

(ConfidentMatch) that is trained using data from electronic health records.

ConfidentMatch predicts the success of an organ transplant (in terms of the 3

year survival rates) on the basis of clinical and demographic traits of the

donor and recipient. ConfidentMatch captures the heterogeneity of the donor and

recipient traits by optimally dividing the feature space into clusters and

constructing different optimal predictive models to each cluster. The system

controls the complexity of the learned predictive model in a way that allows

for assuring more granular and confident predictions for a larger number of

potential recipient-donor pairs, thereby ensuring that predictions are

“personalized” and tailored to individual characteristics to the finest

possible granularity. Experiments conducted on the UNOS heart transplant

dataset show the superiority of the prognostic value of ConfidentMatch to other

competing benchmarks; ConfidentMatch can provide predictions of success with

95% confidence for 5,489 patients of a total population of 9,620 patients,

which corresponds to 410 more patients than the most competitive benchmark

algorithm (DeepBoost).

### Low Latency Anomaly Detection and Bayesian Network Prediction of Anomaly Likelihood

Derek Farren , Thai Pham , Marco Alban-Hidalgo **Subjects** : Learning (cs.LG)

We develop a supervised machine learning model that detects anomalies in

systems in real time. Our model processes unbounded streams of data into time

series which then form the basis of a low-latency anomaly detection model.

Moreover, we extend our preliminary goal of just anomaly detection to

simultaneous anomaly prediction. We approach this very challenging problem by

developing a Bayesian Network framework that captures the information about the

parameters of the lagged regressors calibrated in the first part of our

approach and use this structure to learn local conditional probability

distributions.

### Unsupervised Learning For Effective User Engagement on Social Media

Thai Pham , Camelia Simoiu **Subjects** : Learning (cs.LG)

In this paper, we investigate the effectiveness of unsupervised feature

learning techniques in predicting user engagement on social media.

Specifically, we compare two methods to predict the number of feedbacks (i.e.,

comments) that a blog post is likely to receive. We compare Principal Component

Analysis (PCA) and sparse Autoencoder to a baseline method where the data are

only centered and scaled, on each of two models: Linear Regression and

Regression Tree. We find that unsupervised learning techniques significantly

improve the prediction accuracy on both models. For the Linear Regression

model, sparse Autoencoder achieves the best result, with an improvement in the

root mean squared error (RMSE) on the test set of 42% over the baseline method.

For the Regression Tree model, PCA achieves the best result, with an

improvement in RMSE of 15% over the baseline.

### Learning the best algorithm for max-cut, clustering, and other partitioning problems

Maria-Florina Balcan , Vaishnavh Nagarajan , Ellen Vitercik , Colin White **Subjects** : Data Structures and Algorithms (cs.DS) ; Artificial Intelligence (cs.AI); Learning (cs.LG)

Many data analysis problems are NP-hard, a reality that has motivated

researchers to develop a wealth of approximation algorithms and heuristics over

the past few decades. Max-cut, clustering, and many other widely applicable

partitioning problems fall into this camp. We focus on two common classes of

algorithms that are used for clustering and partitioning problems, including

SDP rounding algorithms and agglomerative clustering algorithms with dynamic

programming. The best algorithm to use typically depends on the specific

application or distribution over instances. A worst-case analysis is often used

to compare algorithms, but worst-case instances may not be present in the

particular problem domain, and thus may be misleading when determining which

algorithm to apply. Therefore, it is necessary to develop optimization methods

which return the algorithms and parameters best suited for typical inputs from

the application at hand. We address this problem for integer quadratic

programming and clustering within a learning-theoretic framework where the

learning algorithm is given samples from an application-specific distribution

over problem instances and learns an algorithm which performs well over the

distribution. We provide sample complexity guarantees and computationally

efficient algorithms which optimize an algorithm family’s parameters to best

suit typical inputs from a particular application. We analyze these algorithms

using the learning-theoretic notion of pseudo-dimension. Along with upper

bounds, we provide the first pseudo-dimension lower bounds in this line of

work, which require an involved analysis of each algorithm family’s performance

on carefully constructed problem instances. Our bounds are tight and therefore

nail down the intrinsic complexity of the algorithm classes we analyze, which

are significantly more complex than classes commonly used in learning theory.

### Benchmarking Quantum Hardware for Training of Fully Visible Boltzmann Machines

Comments: 22 pages, 13 figures, D-Wave quantum system for sampling Boltzmann machines

**Subjects**

:

Quantum Physics (quant-ph)

; Learning (cs.LG); Machine Learning (stat.ML)

Quantum annealing (QA) is a hardware-based heuristic optimization and

sampling method applicable to discrete undirected graphical models. While

similar to simulated annealing, QA relies on quantum, rather than thermal,

effects to explore complex search spaces. For many classes of problems, QA is

known to offer computational advantages over simulated annealing. Here we

report on the ability of recent QA hardware to accelerate training of fully

visible Boltzmann machines. We characterize the sampling distribution of QA

hardware, and show that in many cases, the quantum distributions differ

significantly from classical Boltzmann distributions. In spite of this

difference, training (which seeks to match data and model statistics) using

standard classical gradient updates is still effective. We investigate the use

of QA for seeding Markov chains as an alternative to contrastive divergence

(CD) and persistent contrastive divergence (PCD). Using (k=50) Gibbs steps, we

show that for problems with high-energy barriers between modes, QA-based seeds

can improve upon chains with CD and PCD initializations. For these hard

problems, QA gradient estimates are more accurate, and allow for faster

learning. Furthermore, and interestingly, even the case of raw QA samples (that

is, (k=0)) achieved similar improvements. We argue that this relates to the

fact that we are training a quantum rather than classical Boltzmann

distribution in this case. The learned parameters give rise to hardware QA

distributions closely approximating classical Boltzmann distributions that are

hard to train with CD/PCD.

### Deep Learning with Sets and Point Clouds

Comments: under review at ICLR

**Subjects**

:

Machine Learning (stat.ML)

; Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

We study a simple notion of structural invariance that readily suggests a

parameter-sharing scheme in deep neural networks. In particular, we define

structure as a collection of relations, and derive graph convolution and

recurrent neural networks as special cases. We study composition of basic

structures in defining models that are invariant to more complex “product”

structures such as graph of graphs, sets of images or sequence of sets. For

demonstration, our experimental results are focused on the setting where the

discrete structure of interest is a set. We present results on several novel

and non-trivial problems on sets, including semi-supervised learning using

clustering information, point-cloud classification and set outlier detection.

### Generate Models and Model Criticism via Optimized Maximum Mean Discrepancy

Comments: Submitted to ICLR 2017; public comments at this http URL

**Subjects**

:

Machine Learning (stat.ML)

; Artificial Intelligence (cs.AI); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Methodology (stat.ME)

We propose a method to optimize the representation and distinguishability of

samples from two probability distributions, by maximizing the estimated power

of a statistical test based on the maximum mean discrepancy (MMD). This

optimized MMD is applied to the setting of unsupervised learning by generative

adversarial networks (GAN), in which a model attempts to generate realistic

samples, and a discriminator attempts to tell these apart from data samples. In

this context, the MMD may be used in two roles: first, as a discriminator,

either directly on the samples, or on features of the samples. Second, the MMD

can be used to evaluate the performance of a generative model, by testing the

model’s samples against a reference data set. In the latter role, the optimized

MMD is particularly helpful, as it gives an interpretable indication of how the

model and data distributions differ, even in cases where individual model

samples are not easily distinguished either by eye or by classifier.

### Attending to Characters in Neural Sequence Labeling Models

Comments: Proceedings of COLING 2016

**Subjects**

:

Computation and Language (cs.CL)

; Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Sequence labeling architectures use word embeddings for capturing similarity,

but suffer when handling previously unseen or rare words. We investigate

character-level extensions to such models and propose a novel architecture for

combining alternative word representations. By using an attention mechanism,

the model is able to dynamically decide how much information to use from a

word- or character-level component. We evaluated different architectures on a

range of sequence labeling datasets, and character-level extensions were found

to improve performance on every benchmark. In addition, the proposed

attention-based architecture delivered the best results even with a smaller

number of trainable parameters.

### Riemannian Tensor Completion with Side Information

Tengfei Zhou , Hui Qian , Zebang Shen , Congfu Xu **Subjects** : Machine Learning (stat.ML) ; Learning (cs.LG); Numerical Analysis (cs.NA)

Riemannian optimization methods have shown to be both fast and accurate in

recovering a large-scale tensor from its incomplete observation. However, in

almost all recent Riemannian tensor completion methods, only low rank

constraint is considered. Another important fact, side information or features,

remains far from exploiting within the Riemannian optimization framework. In

this paper, we explicitly incorporate the side information into a Riemannian

minimization model. Specifically, a feature-embedded objective is designed to

substantially reduce the sample complexity. For such a Riemannian optimization,

a particular metric can be constructed based on the curvature of the objective,

which leads to a novel Riemannian conjugate gradient descent solver. Numerical

experiments suggest that our solver is more efficient than the state-of-the-art

when a highly accurate solution is required.

### Reinforcement Learning of Contextual MDPs using Spectral Methods

Kamyar Azizzadenesheli , Alessandro Lazaric , Animashree Anandkumar **Subjects** : Artificial Intelligence (cs.AI) ; Learning (cs.LG); Machine Learning (stat.ML)

We propose a new reinforcement learning (RL) algorithm for contextual Markov

decision processes (CMDP) using spectral methods. CMDPs are structured MDPs

where the dynamics and rewards depend on a smaller number of hidden states or

contexts. If the mapping between the hidden and observed states is known a

priori, then standard RL algorithms such as UCRL are guaranteed to attain low

regret. Is it possible to achieve regret of the same order even when the

mapping is unknown? We provide an affirmative answer in this paper. We exploit

spectral methods to learn the mapping from hidden to observed states with

guaranteed confidence bounds, and incorporate it into the UCRL-based framework

to obtain order-optimal regret.

### Annealing Gaussian into ReLU: a New Sampling Strategy for Leaky-ReLU RBM

Chun-Liang Li , Siamak Ravanbakhsh , Barnabas Poczos **Subjects** : Machine Learning (stat.ML) ; Learning (cs.LG)

Restricted Boltzmann Machine (RBM) is a bipartite graphical model that is

used as the building block in energy-based deep generative models. Due to

numerical stability and quantifiability of the likelihood, RBM is commonly used

with Bernoulli units. Here, we consider an alternative member of exponential

family RBM with leaky rectified linear units — called leaky RBM. We first

study the joint and marginal distributions of leaky RBM under different

leakiness, which provides us important insights by connecting the leaky RBM

model and truncated Gaussian distributions. The connection leads us to a simple

yet efficient method for sampling from this model, where the basic idea is to

anneal the leakiness rather than the energy; — i.e., start from a fully

Gaussian/Linear unit and gradually decrease the leakiness over iterations. This

serves as an alternative to the annealing of the temperature parameter and

enables numerical estimation of the likelihood that are more efficient and more

accurate than the commonly used annealed importance sampling (AIS). We further

demonstrate that the proposed sampling algorithm enjoys faster mixing property

than contrastive divergence algorithm, which benefits the training without any

additional computational cost.

## Information Theory

Comments: 8 pages, 7 figures. arXiv admin note: text overlap with arXiv:1502.05597

**Subjects**

:

Information Theory (cs.IT)

This paper is concerned with SC/FDE for bandwidth-efficient uplink block

transmission, with QAM schemes, in a MU MIMO system. The number of BS receiver

antennas is assumed to be large, but not necessarily much larger than the

overall number of transmitter antennas jointly using the same time/frequency

resource at MT.

In this context, we consider several detection techniques and evaluate, in

detail, the corresponding detection performances (discussed with the help of

selected performance bounds), for a range of values regarding the number of

available BS receiver antennas. From our performance results, we conclude that

simple linear detection techniques, designed to avoid the need of complex

matrix inversions, can lead to unacceptably high error floor levels. However,

by combining the use of such simple linear detectors with an appropriate

interference cancellation procedure – within an iterative DF technique -, a

close approximation to the SIMO MFB performance can be achieved after a few

iterations, even for 64-QAM schemes, when the number of BS antennas is, at

least, five times higher than the number of antennas which are jointly used at

the user terminals.

Comments: 16 pages, 10 figures

**Subjects**

:

Information Theory (cs.IT)

Generalized frequency division multiplexing (GFDM) is considered to be a

promising waveform candidate for 5G new radio. It features good properties such

as low out-of-band (OOB) radiation. One major drawback of GFDM known in the

literature is that a zero-forcing receiver suffers from the noise enhancement

effect since the GFDM (transmitter) matrix in general has a greater-than-unity

condition number. In this paper, we propose a new matrix-based characterization

of GFDM matrices, as opposed to traditional vector-based characterization with

prototype filters. The characterization is helpful in deriving properties of

GFDM matrices, including conditions on GFDM matrices being nonsingular and

unitary, respectively. Further, the condition on the existence of a form of

low-complexity MMSE receivers is also derived. It is found that such a receiver

under frequency-selective channels exists if and only if the GFDM transmitter

matrix is chosen to be unitary. For non-unitary transmitter matrices, a

low-complexity suboptimal MMSE receiver is proposed with a performance very

close to that of an MMSE receiver. Besides, optimal prototype filters in terms

of minimizing receiver mean square error (MSE) are derived and found to

correspond to the use of unitary GFDM matrices under many scenarios. These

optimal filters can be applied in GFDM systems without causing the problem of

noise enhancement, thereby having the same MSE performance as OFDM.

Furthermore, we find that GFDM matrices with a size of power of two do exist in

the class of unitary GFDM matrices. Finally, while the OOB radiation

performance of systems using a unitary GFDM matrix is not optimal in general,

it is shown that the OOB radiation can be satisfactorily low if the parameters

are carefully chosen.

### Leech Constellations of Construction-A Lattices

Comments: 30 pages, 5 figures

**Subjects**

:

Information Theory (cs.IT)

The problem of communicating over the AWGN channel with lattice codes is

addressed in this paper. Theoretically, Voronoi constellations have proven to

yield very powerful lattice codes when the fine lattice is AWGN-good and the

coarse lattice has an optimal shaping gain. However, achieving Shannon capacity

with these premises and practically implementable encoding algorithms is in

general not an easy task. In this work, constructions called Leech

constellations are proposed. They involve Construction-A lattices as fine

lattices and a direct sum of Leech lattices as shaping lattice. The combination

of these two ingredients allows to design finite lattice constellations of

dual-diagonal LDA lattices whose numerically measured waterfall region is

situated at less than 0:8 dB from Shannon capacity. A detailed explanation on

how to perform encoding and demapping is given. A very appreciable feature of

Leech constellations of dual-diagonal LDA lattices is that encoding, deocoding,

and demapping have all linear complexity in the block length.

### Interference Cancellation at Receivers in Cache-Enabled Wireless Networks

Chenchen Yang , Yao Yao , Bin Xia , Kaibin Huang , Weiliang Xie , Yong Zhao **Subjects** : Information Theory (cs.IT)

In this paper, we propose to exploit the limited cache packets as side

information to cancel incoming interference at the receiver side. We consider a

stochastic network where the random locations of base stations and users are

modeled using Poisson point processes. Caching schemes to reap both the local

caching gain and the interference cancellation gain for the users are developed

based on two factors: the density of different user subsets and the packets

cached in the corresponding subsets. The packet loss rate (PLR) is analyzed,

which depends on both the cached packets and the channel state information

(CSI) available at the receiver. Theoretical results reveal the tradeoff

between caching resource and wireless resource. The performance for different

caching schemes are analyzed and the minimum achievable PLR for the distributed

caching is derived.

### Counting generalized Reed-Solomon codes

Peter Beelen , David Glynn , Tom Høholdt , Krishna Kaipa **Subjects** : Information Theory (cs.IT) ; Discrete Mathematics (cs.DM)

In this article we count the number of generalized Reed-Solomon (GRS) codes

of dimension k and length n, including the codes coming from a non-degenerate

conic plus nucleus. We compare our results with known formulae for the number

of 3-dimensional MDS codes of length n=6,7,8,9.

### Towards Information Privacy for the Internet of Things

Comments: 13 pages, 4 figures, submitted to IEEE transaction on signal processing

**Subjects**

:

Information Theory (cs.IT)

In an Internet of Things network, multiple sensors send information to a

fusion center for it to infer a public hypothesis of interest. However, the

same sensor information may be used by the fusion center to make inferences of

a private nature that the sensors wish to protect. To model this, we adopt a

decentralized hypothesis testing framework with binary public and private

hypotheses. Each sensor makes a private observation and utilizes a local sensor

decision rule or privacy mapping to summarize that observation before sending

it to the fusion center. Without assuming knowledge of the joint distribution

of the sensor observations and hypotheses, we adopt a nonparametric learning

approach to design local privacy mappings. We introduce the concept of an

empirical normalized risk, which provides a theoretical guarantee for the

network to achieve information privacy for the private hypothesis with high

probability when the number of training samples is large. We develop iterative

optimization algorithms to determine an appropriate privacy threshold and the

best sensor privacy mappings, and show that they converge. Numerical results on

both synthetic and real data sets suggest that our proposed approach yields low

error rates for inferring the public hypothesis, but high error rates for

detecting the private hypothesis.

### An algebraic framework for end-to-end physical-layer network coding

Elisa Gorla , Alberto Ravagnani **Subjects** : Information Theory (cs.IT) ; Commutative Algebra (math.AC)

We propose an algebraic setup for end-to-end physical-layer network coding

based on submodule transmission. We introduce a distance function between

modules, describe how it relates to information loss and errors, and show how

to compute it. Then we propose a definition of submodule error-correcting code,

and investigate bounds and constructions for such codes.

### BDMA for Millimeter-Wave/Terahertz Massive MIMO Transmission with Per-Beam Synchronization

Comments: 29 pages, 4 figures

**Subjects**

:

Information Theory (cs.IT)

We propose beam division multiple access (BDMA) with per-beam synchronization

(PBS) in time and frequency for wideband massive multiple-input multiple-output

(MIMO) transmission over millimeter-wave (mmW)/Terahertz (THz) bands. We first

introduce a physically motivated beam domain channel model for massive MIMO and

demonstrate that the envelopes of the beam domain channel elements tend to be

independent of time and frequency when both the numbers of antennas at base

station and user terminals (UTs) tend to infinity. Motivated by the derived

beam domain channel properties, we then propose PBS for mmW/THz massive MIMO.

We show that both the effective delay and Doppler frequency spreads of wideband

massive MIMO channels with PBS are reduced by a factor of the number of UT

antennas compared with the conventional synchronization approaches.

Subsequently, we apply PBS to BDMA, investigate beam scheduling to maximize the

achievable ergodic rates for both uplink and downlink BDMA, and develop a

greedy beam scheduling algorithm. Simulation results verify the effectiveness

of BDMA with PBS for mmW/THz wideband massive MIMO systems in typical mobility

scenarios.

### Optimal Placement Delivery Arrays

Comments: 11pages, coded caching, placement delivery array

**Subjects**

:

Information Theory (cs.IT)

In wireless networks, coded caching scheme is an effective technique to

reduce network congestion during peak traffic times. A ((K,F,Z,S)) placement

delivery array (((K,F,Z,S))PDA in short) can be used to design a coded caching

scheme with the delivery rate (S/F) during the peak traffic times. Clearly in

order to minimize delivery rate, we only need to consider a ((K,F,Z,S))PDA with

minimum (S). For the fixed parameters (K), (F) and (Z), a ((K,F,Z,S))PDA with

the minimum (S) is called optimal. So it is meaningful to study optimal PDAs.

There are some known PDAs constructed by combinatorial design theory,

hypergraphs and so on. However there is only one class of optimal PDAs (IEEE

Trans. Inf. Theory 60:2856-2867, 2014) constructed by Ali and Niesen. We will

focus on constructing optimal PDAs. In this paper, two classes of lower bounds

on the value of (S) in a ((K,F,Z,S))PDA are derived first. Then two classes of

recursive constructions are proposed. Consequently (i) optimal PDAs with (Z=1)

and (F-1) for any positive integers (K) and (F) are obtained; (ii) several

infinite classes of optimal PDAs for some (K), (F) and (Z) are constructed.

Finally more existence of optimal PDAs with (Z=F-2) are given.

### Millimeter Wave Beam Alignment: Large Deviations Analysis and Design Insights

Comments: 27 pages, 7 figures, submitted

**Subjects**

:

Information Theory (cs.IT)

In millimeter wave cellular communication, fast and reliable beam alignment

via beam training is crucial to harvest sufficient beamforming gain for the

subsequent data transmission. In this paper, we establish fundamental limits in

beam-alignment performance under both the exhaustive search and the

hierarchical search that adopts multi-resolution beamforming codebooks,

accounting for time-domain training overhead. Specifically, we derive lower and

upper bounds on the probability of misalignment for an arbitrary level in the

hierarchical search, based on a single-path channel model. Using the method of

large deviations, we characterize the decay rate functions of both bounds and

show that the bounds coincide as the training sequence length goes large. With

these results, we characterize the asymptotic misalignment probability of both

the hierarchical and exhaustive search, and show that the latter asymptotically

outperforms the former, subject to the same training overhead and codebook

resolution. We show via numerical results that this relative performance

behavior holds in the non-asymptotic regime. Moreover, the exhaustive search is

shown to achieve significantly higher worst-case spectrum efficiency than the

hierarchical search, when the pre-beamforming signal-to-noise ratio (SNR) is

relatively low. This study hence implies that the exhaustive search is more

effective for users situated not so close to base stations, as they tend to

have low SNR.

### Fast Polarization and Finite-Length Scaling for Non-Stationary Channels

Hessam Mahdavifar **Subjects** : Information Theory (cs.IT)

We consider the problem of polar coding for transmission over a

non-stationary sequence of independent binary-input memoryless symmetric (BMS)

channels (left{W_i

ight}_{i=1}^{infty}), where the (i)-th encoded bit is

transmitted over (W_i). We show, for the first time, a polar coding scheme that

achieves the effective average symmetric capacity ()

overline{I}(left{W_i

ight}_{i=1}^{infty}) := lim_{N

ightarrow infty}

frac{1}{N}sum_{i=1}^N I(W_i), () assuming that the limit exists. The polar

coding scheme is constructed using Ar{i}kan’s channel polarization

transformation in combination with certain permutations at each polarization

level and certain skipped operations. This guarantees a fast polarization

process that results in polar coding schemes with block lengths upper bounded

by a polynomial of (1/epsilon), where (epsilon) is the gap to the average

capacity. More specifically, given an arbitrary sequence of BMS channels

(left{W_i

ight}_{i=1}^{N}) and (P_e), where (0 < P_e <1), we construct a

polar code of length (N) and rate (R) guaranteeing a block error probability of

at most (P_e) for transmission over (left{W_i

ight}_{i=1}^{N}) such that ()

N leq frac{kappa}{(overline{I}_N – R)^{mu}} () where (mu) is a constant,

(kappa) is a constant depending on (P_e) and (mu), and (overline{I}_N) is

the average of the symmetric capacities (I(W_i)), for (i=1,2,,dots,N). We

further show a numerical upper bound on (mu) that is: (mu leq 10.78). The

encoding and decoding complexities of the constructed polar code preserves (O(N

log N)) complexity of Ar{i}kan’s polar codes.

### Self-Calibration via Linear Least Squares

Shuyang Ling , Thomas Strohmer **Subjects** : Information Theory (cs.IT)

Whenever we use devices to take measurements, calibration is indispensable.

While the purpose of calibration is to reduce bias and uncertainty in the

measurements, it can be quite difficult, expensive and sometimes even

impossible to implement. We study a challenging problem called

self-calibration, i.e., the task of designing an algorithm for devices so that

the algorithm is able to perform calibration automatically. More precisely, we

consider the setup (oldsymbol{y} = mathcal{A}(oldsymbol{d}) oldsymbol{x}

+ oldsymbol{epsilon}) where only partial information about the sensing

matrix (mathcal{A}(oldsymbol{d})) is known and where

(mathcal{A}(oldsymbol{d})) linearly depends on (oldsymbol{d}). The goal is

to estimate the calibration parameter (oldsymbol{d}) (resolve the uncertainty

in the sensing process) and the signal/object of interests (oldsymbol{x})

simultaneously. For three different models of practical relevance we show how

such a bilinear inverse problem, including blind deconvolution as an important

example, can be solved via a simple linear least squares approach. As a

consequence, the proposed algorithms are numerically extremely efficient, thus

allowing for real-time deployment. Explicit theoretical guarantees and

stability theory are derived and the number of sampling complexity is nearly

optimal (up to a poly-log factor). Applications in imaging sciences and signal

processing are discussed and numerical simulations are presented to demonstrate

the effectiveness and efficiency of our approach.

### Energy-Based Adaptive Multiple Access in LoRaWAN IoT Systems with Energy Harvesting

Comments: Submitted to IEEE ICASSP 2017

**Subjects**

:

Information Theory (cs.IT)

This paper considers a network of energy harvesting nodes which perform data

communication to a sink node over a multiple access channel. In order to reduce

the complexity of network control resulting from adaptation to the energy

storage level at each node, an optimization framework is proposed where energy

storage dynamics are replaced by dynamic average power constraints induced by

the time correlated energy supply, thus enabling lightweight and flexible

network control. The structure of the throughput-optimal genie-aided

transmission policy is derived under constraints on the average energy

consumption. The policy takes the form of the packet transmission probability

defined on the energy harvesting state and number of active wireless nodes.

Inspired by the genie-aided policy, a Bayesian estimation approach is proposed

to address the case where the base station estimates the number of active

wireless nodes based on the observed network transmission pattern, and

broadcasts low-overhead control information to the whole network.

### Resource Allocation in Wireless Powered Relay Networks: A Bargaining Game Approach

Comments: 14 pages, 7 figures, journal paper

**Subjects**

:

Information Theory (cs.IT)

; Computer Science and Game Theory (cs.GT)

Simultaneously information and power transfer in mobile relay networks have

recently emerged, where the relay can harvest the radio frequency (RF) energy

and then use this energy for data forwarding and system operation. Most of the

previous works do not consider that the relay may have its own objectives, such

as using the harvested energy for its own transmission instead of maximizing

transmission of the network. Therefore, in this paper, we propose a Nash

bargaining approach to balance the information transmission efficiency of

source-destination pairs and the harvested energy of the relay in a wireless

powered relay network with multiple source-destination pairs and one relay. We

analyze and prove that the Nash bargaining problem has several desirable

properties such as the discreteness and quasi-concavity, when it is decomposed

into three sub-problems: the energy transmission power optimization, the power

control for data transmission and the time division between energy transmission

and data transmission. Based on the theoretical analysis, we propose an

alternating power control and time division algorithm to find a suboptimal

solution. Simulation results clearly show and demonstrate the properties of the

problem and the convergence of our algorithm.

### WINDOW: Wideband Demodulator for Optical Waveforms

Omri Lev , Tal Wiener , Deborah Cohen , Yonina C. Eldar **Subjects** : Information Theory (cs.IT)

Optical communication systems, which operate at very high rates, are often

limited by the sampling rate bottleneck. The optical wideband regime may exceed

analog to digital converters (ADCs) front-end bandwidth. Multi-channel sampling

approaches, such as multicoset or interleaved ADCs, have been proposed to

sample the wideband signal using several channels. Each channel samples below

the Nyquist rate such that the overall sampling rate is preserved. However,

this scheme suffers from two practical limitations that make its implementation

difficult. First, the inherent anti-aliasing filter of the samplers distorts

the wideband signal. Second, it requires accurate time shifts on the order of

the signal’s Nyquist rate, which are challenging to maintain. In this work, we

propose an alternative multi-channel sampling scheme, the wideband demodulator

for optical waveforms (WINDOW), based on analog RF demodulation, where each

channel aliases the spectrum using a periodic mixing function before

integration and sampling. We show that intentionally using the inherent ADC

filter to perform integration increases the signal to noise ratio (SNR). We

demonstrate both theoretically and through numerical experiments that our

system outperforms multicoset in terms of signal recovery and symbol estimation

in the presence of both thermal and quantization noise but is slightly less

robust to timing jitter.

### The Entropy Region is not Closed Under Duality

Comments: submitted

**Subjects**

:

Information Theory (cs.IT)

We import a duality notion coming from polymatroids to define duality for

information inequalities. We show that the entropy region for (nge 5) is not

closed under duality. Our result answers an open question of Mat`uv{s}

(1992).

Paul Hand , Vladislav Voroninski **Subjects** : Information Theory (cs.IT) ; Optimization and Control (math.OC)

The phase retrieval problem has garnered significant attention since the

development of the PhaseLift algorithm, which is a convex program that operates

in a lifted space of matrices. Because of the substantial computational cost

due to lifting, many approaches to phase retrieval have been developed,

including non-convex optimization algorithms which operate in the natural

parameter space, such as Wirtinger Flow. Very recently, a convex formulation

called PhaseMax has been discovered, and it has been proven to achieve phase

retrieval via linear programming in the natural parameter space under optimal

sample complexity. The current proofs of PhaseMax rely on statistical learning

theory or geometric probability theory. Here, we present a short and elementary

proof that PhaseMax exactly recovers real-valued vectors from random

measurements under optimal sample complexity. Our proof only relies on standard

probabilistic concentration and covering arguments, yielding a simpler and more

direct proof than those that require statistical learning theory, geometric

probability or the highly technical arguments for Wirtinger Flow-like

approaches.

### Some conjectures on codes

Clelia De Felice **Subjects** : Formal Languages and Automata Theory (cs.FL) ; Information Theory (cs.IT)

Variable-length codes are the bases of the free submonoids of a free monoid.

There are some important longstanding open questions about the structure of

finite maximal codes. In this paper we discuss this conjectures and their

relations with factorizations of cyclic groups.

### Good Integers and Applications in Coding Theory

Comments: 21 pages

**Subjects**

:

Rings and Algebras (math.RA)

; Information Theory (cs.IT); Number Theory (math.NT)

A class of good integers has been introduced by P. Moree in (1997) together

with the characterization of good odd integers. Such integers have shown to

have nice number theoretical properties and wide applications. In this paper, a

complete characterization of all good integers is given.

Two subclasses of good integers are introduced, namely, oddly-good and

evenly-good integers. The characterization and properties of good integers in

the two subclasses are determined.

As applications, good integers and oddly-good integers are applied in the

study of the hulls of abelian codes. The average dimension of the hulls of

abelian codes is given together with some upper and lower bounds.

### Nuclei and automorphism groups of generalized twisted Gabidulin codes

Comments: 20 pages

**Subjects**

:

Combinatorics (math.CO)

; Information Theory (cs.IT)

Generalized twisted Gabidulin codes are one of the few known families of

maximum rank matrix codes over finite fields. As a subset of m by n matrices,

when m=n, the automorphism group of any generalized twisted Gabidulin code has

been completely determined by the authors recently. In this paper, we consider

the same problem for m<n. Under certain conditions on their parameters, we

determine their middle nuclei and right nuclei, which are important invariants

with respect to the equivalence for rank metric codes. Furthermore, we also use

them to derive necessary conditions on the automorphisms of generalized twisted

Gabidulin codes.

### Bounds and Constructions for (overline{3})-Strongly Separable Codes with Length (3)

Comments: 17 pages, separable codes, strongly separable codes

**Subjects**

:

Discrete Mathematics (cs.DM)

; Information Theory (cs.IT)

As separable code (SC, IEEE Trans Inf Theory 57:4843-4851, 2011) and

frameproof code (FPC, IEEE Trans Inf Theory 44:1897-1905, 1998) do in

multimedia fingerprinting, strongly separable code (SSC, Des. Codes and

Cryptogr.79:303-318, 2016) can be also used to construct anti-collusion codes.

Furthermore, SSC is better than FPC and SC in the applications for multimedia

fingerprinting since SSC has lower tracing complexity than that of SC (the same

complexity as FPC) and weaker structure than that of FPC. In this paper, we

first derive several upper bounds on the number of codewords of

(overline{t})-SSC. Then we focus on (overline{3})-SSC with codeword length

(3), and obtain the following two main results: (1) An equivalence between an

SSC and an SC. %Consequently a more tighter upper bound ((3q^2/4)) and lower

bound ((q^{3/2})) on the number of codewords are obtained; (2) An improved

lower bound (Omega (q^{5/3}+q^{4/3}-q)) on the size of a (q)-ary SSC when

(q=q_1^6) for any prime power (q_1equiv 1 pmod 6), better than the before

known bound (lfloorsqrt{q}

floor^{3}), which is obtained by means of

difference matrix and the known result on the subset of (mathbb{F}^{n}_q)

containing no three points on a line.

### A linear-time benchmarking tool for generalized surface codes

Comments: Software available online: this http URL

**Subjects**

:

Quantum Physics (quant-ph)

; Information Theory (cs.IT)

Quantum information processors need to be protected against errors and

faults. One of the most widely considered fault-tolerant architecture is based

on surface codes. While the general principles of these codes are well

understood and basic code properties such as minimum distance and rate are easy

to characterize, a code’s average performance depends on the detailed geometric

layout of the qubits. To date, optimizing a surface code architecture and

comparing different geometric layouts relies on costly numerical simulations.

Here, we propose a benchmarking algorithm for simulating the performance of

surface codes, and generalizations thereof, that runs in linear time. We

implemented this algorithm in a software that generates performance reports and

allows to quickly compare different architectures.

Comments: 30 pages, 6 figures, submit to Tran. Wireless Commun

**Subjects**

:

Networking and Internet Architecture (cs.NI)

; Information Theory (cs.IT)

In this paper, we study the resource allocation problem for a single-cell

non-orthogonal multiple access (NOMA) relay network where an OFDM

amplify-and-forward (AF) relay allocates the spectrum and power resources to

the source-destination (SD) pairs. We aim to optimize the resource allocation

to maximize the average sum-rate. The optimal approach requires an exhaustive

search, leading to an NP-hard problem. To solve this problem, we propose two

efficient many-to-many two-sided SD pair-subchannel matching algorithms in

which the SD pairs and sub-channels are considered as two sets of players

chasing their own interests. The proposed algorithms can provide a sub-optimal

solution to this resource allocation problem in affordable time. Both the

static matching algorithm and dynamic matching algorithm converge to a

pair-wise stable matching after a limited number of iterations. Simulation

results show that the capacity of both proposed algorithms in the NOMA scheme

significantly outperforms the conventional orthogonal multiple access scheme.

The proposed matching algorithms in NOMA scheme also achieve a better

user-fairness performance than the conventional orthogonal multiple access.

Comments: 5 pages, 2 figures; comments are welcome

**Subjects**

:

Quantum Physics (quant-ph)

; Information Theory (cs.IT)

A novel attack on the coherent quantum key distribution (coherent one-way,

COW) protocol is considered. The key idea of the proposed attack is performing

of individual measurements on a fraction of intercepted states and transferring

or blocking of the remaining part depending on a measurement result.

### Entropic Causal Inference

Comments: To appear in AAAI 2017

**Subjects**

:

Artificial Intelligence (cs.AI)

; Information Theory (cs.IT); Machine Learning (stat.ML)

We consider the problem of identifying the causal direction between two

discrete random variables using observational data. Unlike previous work, we

keep the most general functional model but make an assumption on the unobserved

exogenous variable: Inspired by Occam’s razor, we assume that the exogenous

variable is simple in the true causal direction. We quantify simplicity using

R’enyi entropy. Our main result is that, under natural assumptions, if the

exogenous variable has low (H_0) entropy (cardinality) in the true direction,

it must have high (H_0) entropy in the wrong direction. We establish several

algorithmic hardness results about estimating the minimum entropy exogenous

variable. We show that the problem of finding the exogenous variable with

minimum entropy is equivalent to the problem of finding minimum joint entropy

given (n) marginal distributions, also known as minimum entropy coupling

problem. We propose an efficient greedy algorithm for the minimum entropy

coupling problem, that for (n=2) provably finds a local optimum. This gives a

greedy algorithm for finding the exogenous variable with minimum (H_1) (Shannon

Entropy). Our greedy entropy-based causal inference algorithm has similar

performance to the state of the art additive noise models in real datasets. One

advantage of our approach is that we make no use of the values of random

variables but only their distributions. Our method can therefore be used for

causal inference for both ordinal and also categorical data, unlike additive

noise models.

欢迎加入我爱机器学习QQ6群： **337537549**

微信扫一扫，关注我爱机器学习公众号

微博：我爱机器学习

原文链接：arXiv Paper Daily: Tue, 15 Nov 2016，转载请注明来源！