# Lectures

##### Maximum Entropy and Information Constraints in Large Networks and Power-law Graphs

Many large networks occurring naturally (e.g. the phone calls graph, the internet domains and routers, the World Wide Web, metabolic and protein networks) are characterized by the power-law degree sequence $N(k)\propto k^{-\beta}$, such that the number $N(k)$ of vertexes (nodes) with degree $k$ is inversely proportional to that degree with the exponent parameter $\beta>0$ (i.e. nodes with small degree occur more frequently). It is also known that such graphs can be generated by the preferential attachment procedure, when new nodes are more likely to be connected to nodes with high degree. I will show how the power-law degree sequence can be obtained as a solution to the maximum entropy problem with a constraint on the expectation of the logarithm of the degree. Then I will show that the preferential attachment procedure can be obtained as a solution to the dual problem of minimizing Shannon’s mutual information between nodes subject to a constraint on the expected path length in the graph. This will allow us to introduce two important variational problems of information theory and show duality between them. In addition, this information-theoretic approach will allow us to give a new interpretation of some results about the power-law graphs. In particular, we shall discuss conditions for connectedness and existence of a giant component in large power law graphs and the corresponding phase transition. We shall also derive a new formula for estimating the exponent parameter $\beta$, which is the Lagrange multiplier related to the constraint in the maximum entropy problem.

##### Value of Information and Geometry of Optimal Decision-making and Learning

Mathematical theory of optimal decisions under uncertainty is based on the idea of maximization of the expected utility functional over a set of lotteries. This view appears natural for mathematicians, who consider these lotteries as probability measures on some common algebra of events, and linear structure of the space of measures leads to the celebrated in game theory result of von Neumann and Morgenstern about the existence of a linear or affine objective functional — the expected utility. Behavioural economists and psychologists, on the other hand, have demonstrated that people consistently violate the axioms of expected utility, and this includes professional risk-takers such as stockbrokers. In this talk I will show how these paradoxes can be explained if, apart from utility, one also considers the value of information that a decision-maker receives. This will be a good opportunity to introduce the value of information theory, which was developed by Rouslan Stratonovich in the 1960s as an amalgamation of theories of optimal decision-making and information. I will also outline a new geometric approach to the value of information, which will allow us to see that some properties of the optimal value function are independent of a specific definition of information. We shall extend this approach to a dynamical system, in which decisions with information constraints are made sequentially, and define a frontier corresponding to an optimally learning system. I will show how this theory gives new insights into parameter control of learning and optimization algorithms.

##### Deep Learning and AI

There has been much progress in AI thanks to advances in deep learning in recent years, especially in areas such as computer vision, speech recognition, natural language processing, playing games, robotics, machine translation, etc. This lecture aims at explaining some of the core concepts and motivations behind deep learning and representation learning. Deep learning builds on many of the ideas introduced decades earlier with the connectionist approach to machine learning, inspired by the brain. These essential early contributions include the notion of distributed representation and the back-propagation algorithm for training multi-layer neural networks, but also the architecture of recurrent neural networks and convolutional neural networks. In addition to the substantial increase in computing power and dataset sizes, many modern additions have contributed to the recent successes.
These include techniques making it possible to train networks with more layers – which can generalize better (hence the name deep learning) – as well as a better theoretical understanding for the success of deep learning, both from an optimization point of view and
from a generalization point of view. Two other areas of major progress have been in unsupervised learning, in particular the ability of neural networks to stochastically generate high-dimensional samples (like images) from a possibly conditional distribution, as well as the combination of reinforcement learning and deep learning techniques.

##### Deep Learning, Recurrent Nets and Attention for System 2 Processing

Much of the progress in deep learning has been in supervised learning and perception, making it possible to provide a form of intuitive understanding to computers, and more generally to achieve good performance in System 1 cognitive tasks (unconscious, fast, intuitive). Much remains to be done to handle System 2 processing (conscious, slow, sequential, linguistic, explicit). The tools we already have are based on recurrent neural networks, which will be
described with their own issues, and attention mechanisms. Together, these advances have moved neural nets from pattern recognition devices working on vectors to general-purpose differentiable modular machines which can handle arbitrary data structures. The lecture will close with a discussion of open questions to handle more of System 2 capabilities, such as the ability to reason, to anchor language in an intuitive world model, to model attentive consciousness, to capture causal explanations at multiple time scales, and more generally to
exploit action to build better models of the world.

##### Unsupervised Representation Learning and Generative Adversarial Networks

One of the central questions for deep learning is how a learning agent could discover good representations in an unsupervised way. First, we consider the still open question of what constitutes a good representation, with the notion of disentangling the underlying factors of variation. We view representation learning from a geometric perspective as a transformation of the data space which changes the shape of the data manifold in order to flatten it, thus separating the sources of variation from each other. Second, we discuss issues with the maximum likelihood framework which has been behind our early work on Boltzmann machines as well as our work on auto-regressive and recurrent neural networks as generative models. These issues motivated our initial development of Generative Adversarial Networks, a research area which has greatly expanded recently. We discuss how adversarial training can be used to obtain invariances to some factors in the representation, and a way to make training with such an adversarial objective more stable by pushing the discriminator score towards the classification boundary but not past it. Finally, we discuss applications of GAN ideas to estimate, minimize or maximize mutual information, entropy or independence between random variables.

##### Network-based Data Analysis

Many real-life complex systems can be conveniently represented using networks, with the nodes corresponding to the system’s components and the arcs describing their pairwise interactions. Analyzing structural properties of a network model provides useful insights into the underlying system’s behavior. This talk introduces the basics of the network-based approach to analysis of large data sets and discusses several applications of this methodology.

##### Cluster-detection Methods in Network-based Data Analysis

Cluster analysis is an important task arising in network-based data analysis. Perhaps the most natural model of a cluster in a network is given by a clique, which is a subset of pairwise-adjacent nodes. However, the clique model appears to be overly restrictive in practice, which has led to introduction of numerous models relaxing various properties of cliques, known as clique relaxations. This talk focuses on a systematic cluster analysis framework based on clique relaxation models.

##### Continuous Approaches to Cluster-Detection Problems in Networks

We discuss continuous formulations for several cluster-detection problems in networks, including the maximum edge weight clique, the maximum s-plex, and the maximum independent union of cliques problems. More specifically, the problems of interested are formulated as quadratic, cubic, or higher-degree polynomial optimization problems subject to linear (typically, unit hypercube) constraints. The proposed formulations are used to develop analytical bounds as well as effective algorithms for some of the problems.

##### The Principle of Least Cognitive Action: The Case of Visual Features – Part I

In this talk we introduce the principle of Least Cognitive Action with the purpose of understanding perceptual learning processes in a framework of laws of nature that closely parallels related approaches in physics. Neural networks are regarded as systems whose connections are Lagrangian variables, namely functions depending on time. They are used to minimize the cognitive action, an appropriate functional index that is composed of a potential and of a kinetic term, that is shown to fit very well the classic machine learning criteria based on regularization.
The theory is applied to the construction of an unsupervised learning scheme for visual features in deep convolutional neural networks, where an appropriate Lagrangian term is used to enforce a solution where the features are developed under motion invariance. The causal optimization of the cognitive action yields a solution where learning is carried out by an opportune blurring of the video, along the interleaving of segments of null signal. Interestingly, this also sheds light on the video blurring process in newborns, as well as on the benefit from eye blinking and day-night rhythm.

##### Learning with Constraints – Part I

Learning and inference are traditionally regarded as the two opposite, yet complementary and puzzling components of intelligence. In this talk we point out that a constrained-based modeling of the environmental agent interactions makes it possible to unify learning and inference within the same mathematical framework. The unification is based on the abstract notion of constraint, which provides a representation of knowledge granules gained from the interaction with the environment. The agents are based on a deep neural network architecture, and their learning and inferential processes are driven by different schemes for enforcing the environmental constraints. Logic constraints are also included thanks to their translation into real-valued functions that arises from the adoption of opportune t-norms.
The basic ideas are presented by simple case studies ranging from learning and inference in social nets, missing data, checking of logic constraints, and pattern generation. The theory offers a natural bridge between the formalization of knowledge and the inductive acquisition of concepts from data.

##### The Principle of Least Cognitive Action: The Case of Visual Features – Part II

In this talk we introduce the principle of Least Cognitive Action with the purpose of understanding perceptual learning processes in a framework of laws of nature that closely parallels related approaches in physics. Neural networks are regarded as systems whose connections are Lagrangian variables, namely functions depending on time. They are used to minimize the cognitive action, an appropriate functional index that is composed of a potential and of a kinetic term, that is shown to fit very well the classic machine learning criteria based on regularization.
The theory is applied to the construction of an unsupervised learning scheme for visual features in deep convolutional neural networks, where an appropriate Lagrangian term is used to enforce a solution where the features are developed under motion invariance. The causal optimization of the cognitive action yields a solution where learning is carried out by an opportune blurring of the video, along the interleaving of segments of null signal. Interestingly, this also sheds light on the video blurring process in newborns, as well as on the benefit from eye blinking and day-night rhythm.

##### Learning with Constraints – Part II

Learning and inference are traditionally regarded as the two opposite, yet complementary and puzzling components of intelligence. In this talk we point out that a constrained-based modeling of the environmental agent interactions makes it possible to unify learning and inference within the same mathematical framework. The unification is based on the abstract notion of constraint, which provides a representation of knowledge granules gained from the interaction with the environment. The agents are based on a deep neural network architecture, and their learning and inferential processes are driven by different schemes for enforcing the environmental constraints. Logic constraints are also included thanks to their translation into real-valued functions that arises from the adoption of opportune t-norms.
The basic ideas are presented by simple case studies ranging from learning and inference in social nets, missing data, checking of logic constraints, and pattern generation. The theory offers a natural bridge between the formalization of knowledge and the inductive acquisition of concepts from data.

##### Machine Learning for Education and Education for Machine Learning

We now know how to train a program to play world-champion-level chess or Go, just by simulating moves into the future and reinforcing the best ones. But we can’t train a human student to learn in the same way. This talk examines what we do know about applying data to education, and what might be coming in the future.

##### Software Engineering for Machine Learning and Machine Learning for Software Engineering

The software industry has built up a formidable set of tools for software development over the last half century. But we are just starting to understand what tools are needed to build software with machine learning, not hand-coding. Some of those tools will themselves make use of machine learning.

##### Networks: History, The Present, and Future Challenges

The network revolution began in the 18th century when Euler solved the famous Konigsberg bridge problem. In the 19th century, Kirchhoff initiated the theory of electrical networks and
was the first person who defined the flow conservation equations, one of the milestones of network flow theory. After the invention of the telephone by Alexander Graham Bell in the 19th
century, the resulting applications stimulated network analysis. The field evolved dramatically after the 19th century. At present, our lives are both affected by and interface with the networks that connect us.
After a brief historical overview, the talk will focus on recent exciting developments and discuss future challenges not only in the science of networks but also in our network-driven and
changing society.

##### On the Limits of Computation in Non-convex Optimization

Large scale problems in engineering, in the design of networks and energy systems, the biomedical fields, and finance are modeled as optimization problems. Humans and nature are constantly optimizing to minimize costs or maximize profits, to maximize the flow in a network, or to minimize the probability of a blackout in a smart grid.
Due to new algorithmic developments and the computational power of machines (digital, analog, biochemical, quantum computers etc), optimization algorithms have been used to “solve” problems in a wide spectrum of applications in science and engineering.
But what do we mean by “solving” an optimization problem? What are the limits of what machines (and humans) can compute?

##### Networks in Finance and Economics

Financial markets, banks, currency exchanges and other institutions can be modeled and analyzed as network structures where nodes are any agents such as companies, shareholders, currencies, or countries. The edges (can be weighted, oriented, etc.) represent any type of relations between agents, for example, ownership, friendship, collaboration, influence, dependence, and correlation. We are going to discuss network and data sciences techniques to study the dynamics of financial markets and other problems in economics.

##### Social Physics: Learning and Predicting Human Behavior

Human behavior is as much a function of social influence as personal thought. Consequently observations of interactions (including simple observation or others) and the behaviors of others can accurately predict the future behavior of individuals. I will review the evidence for these effects, and show how to construct predictive models from observations. I will then review the use of various data sources and methods of analysis in terms of their usefulness for human behavior prediction.

##### Privacy-Preserving and Distributed Machine Learning

The availability of data from phones, credit cards, automobiles and other modern technologies places personal privacy at risk. Regulations such as GDPR are improving the situation, however many worry that these new restrictions will prevent prevent improvements in health, decision making, and other civil systems. Another worry is that these data are held by different institutions, so that it is difficult to achieve a wholistic view of the situation. I will explain my work that led to GDPR, and methods of handling and analyzing data that protect both privacy and proprietary data and yet provide state-of-the-art insights.

##### Unsupervised Machine Translation – Part I

Machine Translation (MT) is a flagship of the recent successes and advances in the field of natural language processing. Its practical applications and use as a testbed for sequence transduction algorithms have spurred renewed interest in this topic. While MT systems have shown to achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large
amounts of parallel sentences, which hinders their applicability to the majority of language pairs.
In these lectures, I will first give a brief overview of how Deep Learning is used for text applications, and in particular for MT. Then, I will discuss our recent work on learning to translate with access to only large monolingual corpora in each language, but no example of sentences with their corresponding translations.
We propose two model variants, a neural and a phrase-based model. Although these models operate quite differently, they are both based on the same principles, namely careful initialization of parameters, the use of powerful language models and the artificial generation of parallel data.
I will show how these approaches enable translations from similar languages as well as languages that do not even share the same alphabet and linguistic structure, opening the door towards systems that can translate myriads of language pairs for which we have little if any bitexts.

##### Unsupervised Machine Translation – Part II

Machine Translation (MT) is a flagship of the recent successes and advances in the field of natural language processing. Its practical applications and use as a testbed for sequence transduction algorithms have spurred renewed interest in this topic. While MT systems have shown to achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large
amounts of parallel sentences, which hinders their applicability to the majority of language pairs.
In these lectures, I will first give a brief overview of how Deep Learning is used for text applications, and in particular for MT. Then, I will discuss our recent work on learning to translate with access to only large monolingual corpora in each language, but no example of sentences with their corresponding translations.
We propose two model variants, a neural and a phrase-based model. Although these models operate quite differently, they are both based on the same principles, namely careful initialization of parameters, the use of powerful language models and the artificial generation of parallel data.
I will show how these approaches enable translations from similar languages as well as languages that do not even share the same alphabet and linguistic structure, opening the door towards systems that can translate myriads of language pairs for which we have little if any bitexts.

##### Challenges in Neural Machine Translation

Machine Translation (MT) is a flagship of the recent successes and advances in the field of natural language processing. Its practical applications and use as a testbed for sequence
transduction algorithms have spurred renewed interest in this topic. Despite great successes of neural models for MT, there is still a general lack of understanding of how these models work and fit the data distribution. In particular, MT is inherently a one-to-many learning task, as there are several plausible translations of the same source sentence. Uncertainty in the prediction task is due to both the existence of multiple valid translations for a single source sentence, and the extrinsic uncertainty caused by noise in the training data.
In the first part of this talk, I will present tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that
generate translations.
In the second part of the talk, I will instead focus on another challenge, which is the discrepancy between how neural MT models are trained, and how they are used at test time.
At training time, these models are only asked to predict the next word in the sentence, while at test time they are asked to predict the whole sentence (translation), which may lead to accumulation of errors. I will conclude the lecture with a survey of methods to train neural MT systems at the sequence-level, showing that classical structure prediction losses are quite effective, although with diminishing returns as baseline systems get stronger.

##### Choice Models in Data Analysis and their Applications – Part I

Several models, based on choice procedures and the superposition principle are developed and applied for different methods of smart data analysis. Several applications are discussed, in particular, to the data on retailer data analysis, on banking, to the data provided by Microsoft and to the tornado prediction. For efficiency evaluation of tornado prediction, the constructed model has been tested on real-life data obtained from the University of Oklahoma (USA). It is shown that the constructed tornado prediction model is more efficient than all previous models.

##### Choice Models in Data Analysis and their Applications – Part II

Several models, based on choice procedures and the superposition principle are developed and applied for different methods of smart data analysis. Several applications are discussed, in particular, to the data on retailer data analysis, on banking, to the data provided by Microsoft and to the tornado prediction. For efficiency evaluation of tornado prediction, the constructed model has been tested on real-life data obtained from the University of Oklahoma (USA). It is shown that the constructed tornado prediction model is more efficient than all previous models.