#### Biography

*Positions*: a) Head, Department of Mathematics for Economics, National Research University Higher

School of Economics; b) Head, International Laboratory of Decision Choice and Analysis, National

Research University Higher School of Economics; c) Head, Laboratory of Choice Theory and Decision

Analysis, Russian Academy of Sciences Institute of Control Sciences

*Education*: 1969-1974, Student, Mathematics Faculty, Moscow State University; 1981, Ph.D. in

Control in Socio-Economic Systems (thesis title “Interval Choice”)

*Distinctions and Awards:* 1993, Doctor of Science (thesis title “Local Aggregation Models”),

Honorary Worker of Science and Technology of Russian Federation, 2011; Medal of the Order “For Merit

II” (Decree of the President of Russia on 21.12.2013)

Publications: 10 books, more than 200 articles, more than 100 in peer-reviewed journals and volumes, Copyright

certificates, patents – 5

*Other Professional Activities:* Member of International Economic Association (member of the Executive Council, 2011-2017); American Mathematical Society; New Economic Association, Russia Member of Editorial Board for the journals: Mathematical Social Sciences, Automation and Remote Control, Political Studies (TBF), Control Problems (in Russian), Politeia (in Russian), Economic Journal HSE (in Russian), Business-informatics (in Russian), Journal of New Economic Association (in Russian), vice-editor-in-chief, Mathematical Game Theory and its Applications, Сontrol in Large-Scale Systems (on-line journal, in Russian), International Journal of Information Technologies and Decision Making, Annals of Data Analysis, Group Decisions and Negotiation

*Invited Speaker:* more than 90 conferences and workshops

*Other activities:* 2008 – present, Head of the Board of the Congregation ‘Le Dor va Dor’ of the World Union of Progressive Judaism, Moscow, Russia

#### Lectures

Several models, based on choice procedures and the superposition principle are developed and applied for different methods of smart data analysis. Several applications are discussed, in particular, to the data on retailer data analysis, on banking, to the data provided by Microsoft and to the tornado prediction. For efficiency evaluation of tornado prediction, the constructed model has been tested on real-life data obtained from the University of Oklahoma (USA). It is shown that the constructed tornado prediction model is more efficient than all previous models.

Several models, based on choice procedures and the superposition principle are developed and applied for different methods of smart data analysis. Several applications are discussed, in particular, to the data on retailer data analysis, on banking, to the data provided by Microsoft and to the tornado prediction. For efficiency evaluation of tornado prediction, the constructed model has been tested on real-life data obtained from the University of Oklahoma (USA). It is shown that the constructed tornado prediction model is more efficient than all previous models.

#### Biography

Roman Belavkin is a Reader in Informatics at the Department of Computer Science, Middlesex University, UK. He has MSc degree in Physics from the Moscow State University and PhD in Computer Science from the University of Nottingham, UK. In his PhD thesis, Roman combined cognitive science and information theory to study the role of emotion in decision-making, learning and problem solving. His main research interests are in mathematical theory of dynamics of information and optimization of learning, adaptive and evolving systems. He used information value theory to give novel explanations of some common decision-making paradoxes. His work on optimal transition kernels showed non-existence of optimal deterministic strategies in a broad class of problems with information constraints.

Roman’s theoretical work on optimal parameter control in algorithms has found applications to computer science and biology. From 2009, Roman lead a collaboration between four UK universities involving mathematics, computer science and experimental biology on optimal mutation rate control, which lead to the discovery in 2014 of mutation rate control in bacteria (reported in Nature Communications http://doi.org/skb and PLOS Biology http://doi.org/cb9s). He also contributed to research projects on neural cell-assemblies, independent component analysis and anomaly detection, such as cyber attacks.

#### Lectures

Many large networks occurring naturally (e.g. the phone calls graph, the internet domains and routers, the World Wide Web, metabolic and protein networks) are characterized by the power-law degree sequence $N(k)\propto k^{-\beta}$, such that the number $N(k)$ of vertexes (nodes) with degree $k$ is inversely proportional to that degree with the exponent parameter $\beta>0$ (i.e. nodes with small degree occur more frequently). It is also known that such graphs can be generated by the preferential attachment procedure, when new nodes are more likely to be connected to nodes with high degree. I will show how the power-law degree sequence can be obtained as a solution to the maximum entropy problem with a constraint on the expectation of the logarithm of the degree. Then I will show that the preferential attachment procedure can be obtained as a solution to the dual problem of minimizing Shannon’s mutual information between nodes subject to a constraint on the expected path length in the graph. This will allow us to introduce two important variational problems of information theory and show duality between them. In addition, this information-theoretic approach will allow us to give a new interpretation of some results about the power-law graphs. In particular, we shall discuss conditions for connectedness and existence of a giant component in large power law graphs and the corresponding phase transition. We shall also derive a new formula for estimating the exponent parameter $\beta$, which is the Lagrange multiplier related to the constraint in the maximum entropy problem.

Mathematical theory of optimal decisions under uncertainty is based on the idea of maximization of the expected utility functional over a set of lotteries. This view appears natural for mathematicians, who consider these lotteries as probability measures on some common algebra of events, and linear structure of the space of measures leads to the celebrated in game theory result of von Neumann and Morgenstern about the existence of a linear or affine objective functional — the expected utility. Behavioural economists and psychologists, on the other hand, have demonstrated that people consistently violate the axioms of expected utility, and this includes professional risk-takers such as stockbrokers. In this talk I will show how these paradoxes can be explained if, apart from utility, one also considers the value of information that a decision-maker receives. This will be a good opportunity to introduce the value of information theory, which was developed by Rouslan Stratonovich in the 1960s as an amalgamation of theories of optimal decision-making and information. I will also outline a new geometric approach to the value of information, which will allow us to see that some properties of the optimal value function are independent of a specific definition of information. We shall extend this approach to a dynamical system, in which decisions with information constraints are made sequentially, and define a frontier corresponding to an optimally learning system. I will show how this theory gives new insights into parameter control of learning and optimization algorithms.

#### Biography

Yoshua Bengio is Full Professor of the Department of Computer Science and Operations Research,head of the Montreal Institute for Learning Algorithms (MILA), CIFAR Program co-director of the CIFAR program on Learning in Machines and Brains, Canada Research Chair in Statistical Learning Algorithms. His main research ambition is to understand principles of learning that yield intelligence. He supervises a large group of graduate students and post-docs. His research is widely cited (over 80000 citations found by Google Scholar in September 2017, with an H-index of 101).

Yoshua Bengio is currently action editor for the Journal of Machine Learning Research, associate editor for the Neural Computation journal, editor for Foundations and Trends in Machine Learning, and has been associate editor for the Machine Learning Journal and the IEEE Transactions on Neural Networks.

Yoshua Bengio was Program Chair for NIPS‘2008 and General Chair for NIPS‘2009 (NIPS is the flagship conference in the areas of learning algorithms and neural computation). Since 1999, he has been co-organizing the Learning Workshop with Yann Le Cun, with whom he has also created the International Conference on Representation Learning (ICLR). He has also organized or co-organized numerous other events, principally the deep learning workshops and symposiua at NIPS and ICML since 2007. Yoshua Bengio is Officer of the Order of Canada and member of the Royal Society of Canada.

#### Lectures

There has been much progress in AI thanks to advances in deep learning in recent years, especially in areas such as computer vision, speech recognition, natural language processing, playing games, robotics, machine translation, etc. This lecture aims at explaining some of the core concepts and motivations behind deep learning and representation learning. Deep learning builds on many of the ideas introduced decades earlier with the connectionist approach to machine learning, inspired by the brain. These essential early contributions include the notion of distributed representation and the back-propagation algorithm for training multi-layer neural networks, but also the architecture of recurrent neural networks and convolutional neural networks. In addition to the substantial increase in computing power and dataset sizes, many modern additions have contributed to the recent successes.

These include techniques making it possible to train networks with more layers – which can generalize better (hence the name deep learning) – as well as a better theoretical understanding for the success of deep learning, both from an optimization point of view and

from a generalization point of view. Two other areas of major progress have been in unsupervised learning, in particular the ability of neural networks to stochastically generate high-dimensional samples (like images) from a possibly conditional distribution, as well as the combination of reinforcement learning and deep learning techniques.

Much of the progress in deep learning has been in supervised learning and perception, making it possible to provide a form of intuitive understanding to computers, and more generally to achieve good performance in System 1 cognitive tasks (unconscious, fast, intuitive). Much remains to be done to handle System 2 processing (conscious, slow, sequential, linguistic, explicit). The tools we already have are based on recurrent neural networks, which will be

described with their own issues, and attention mechanisms. Together, these advances have moved neural nets from pattern recognition devices working on vectors to general-purpose differentiable modular machines which can handle arbitrary data structures. The lecture will close with a discussion of open questions to handle more of System 2 capabilities, such as the ability to reason, to anchor language in an intuitive world model, to model attentive consciousness, to capture causal explanations at multiple time scales, and more generally to

exploit action to build better models of the world.

One of the central questions for deep learning is how a learning agent could discover good representations in an unsupervised way. First, we consider the still open question of what constitutes a good representation, with the notion of disentangling the underlying factors of variation. We view representation learning from a geometric perspective as a transformation of the data space which changes the shape of the data manifold in order to flatten it, thus separating the sources of variation from each other. Second, we discuss issues with the maximum likelihood framework which has been behind our early work on Boltzmann machines as well as our work on auto-regressive and recurrent neural networks as generative models. These issues motivated our initial development of Generative Adversarial Networks, a research area which has greatly expanded recently. We discuss how adversarial training can be used to obtain invariances to some factors in the representation, and a way to make training with such an adversarial objective more stable by pushing the discriminator score towards the classification boundary but not past it. Finally, we discuss applications of GAN ideas to estimate, minimize or maximize mutual information, entropy or independence between random variables.

#### Biography

Dr. Butenko’s research concentrates mainly on global and discrete optimization and their applications. In particular, he is interested in theoretical and computational aspects of continuous global optimization approaches for solving discrete optimization problems on graphs. Applications of interest include network-based data mining, analysis of biological and social networks, wireless ad hoc and sensor networks, energy, and sports analytics.

#### Lectures

Many real-life complex systems can be conveniently represented using networks, with the nodes corresponding to the system’s components and the arcs describing their pairwise interactions. Analyzing structural properties of a network model provides useful insights into the underlying system’s behavior. This talk introduces the basics of the network-based approach to analysis of large data sets and discusses several applications of this methodology.

Cluster analysis is an important task arising in network-based data analysis. Perhaps the most natural model of a cluster in a network is given by a clique, which is a subset of pairwise-adjacent nodes. However, the clique model appears to be overly restrictive in practice, which has led to introduction of numerous models relaxing various properties of cliques, known as clique relaxations. This talk focuses on a systematic cluster analysis framework based on clique relaxation models.

We discuss continuous formulations for several cluster-detection problems in networks, including the maximum edge weight clique, the maximum s-plex, and the maximum independent union of cliques problems. More specifically, the problems of interested are formulated as quadratic, cubic, or higher-degree polynomial optimization problems subject to linear (typically, unit hypercube) constraints. The proposed formulations are used to develop analytical bounds as well as effective algorithms for some of the problems.

#### Biography

Marco Gori received the Ph.D. degree in 1990 from Università di Bologna, Italy, while working partly as a visiting student at the School of Computer Science, McGill University – Montréal. In 1992, he became an associate professor of Computer Science at Università di Firenze and, in November 1995, he joint the Università di Siena, where he is currently full professor of computer science. His main interests are in machine learning, computer vision, and natural language processing. He was the leader of the WebCrow project supported by Google for automatic solving of crosswords, that outperformed human competitors in an official competition within the ECAI-06 conference. He has just published the book “Machine Learning: A Constrained-Based Approach,” where you can find his view on the field.

He has been an Associated Editor of a number of journals in his area of expertise, including The IEEE Transactions on Neural Networks and Neural Networks, and he has been the Chairman of the Italian Chapter of the IEEE Computational Intelligence Society and the President of the Italian Association for Artificial Intelligence. He is a fellow of the ECCAI (EurAI) (European Coordinating Committee for Artificial Intelligence), a fellow of the IEEE, and of IAPR. He is in the list of top Italian scientists kept by VIA-Academy.

#### Lectures

In this talk we introduce the principle of Least Cognitive Action with the purpose of understanding perceptual learning processes in a framework of laws of nature that closely parallels related approaches in physics. Neural networks are regarded as systems whose connections are Lagrangian variables, namely functions depending on time. They are used to minimize the cognitive action, an appropriate functional index that is composed of a potential and of a kinetic term, that is shown to fit very well the classic machine learning criteria based on regularization.

The theory is applied to the construction of an unsupervised learning scheme for visual features in deep convolutional neural networks, where an appropriate Lagrangian term is used to enforce a solution where the features are developed under motion invariance. The causal optimization of the cognitive action yields a solution where learning is carried out by an opportune blurring of the video, along the interleaving of segments of null signal. Interestingly, this also sheds light on the video blurring process in newborns, as well as on the benefit from eye blinking and day-night rhythm.

Learning and inference are traditionally regarded as the two opposite, yet complementary and puzzling components of intelligence. In this talk we point out that a constrained-based modeling of the environmental agent interactions makes it possible to unify learning and inference within the same mathematical framework. The unification is based on the abstract notion of constraint, which provides a representation of knowledge granules gained from the interaction with the environment. The agents are based on a deep neural network architecture, and their learning and inferential processes are driven by different schemes for enforcing the environmental constraints. Logic constraints are also included thanks to their translation into real-valued functions that arises from the adoption of opportune t-norms.

The basic ideas are presented by simple case studies ranging from learning and inference in social nets, missing data, checking of logic constraints, and pattern generation. The theory offers a natural bridge between the formalization of knowledge and the inductive acquisition of concepts from data.

In this talk we introduce the principle of Least Cognitive Action with the purpose of understanding perceptual learning processes in a framework of laws of nature that closely parallels related approaches in physics. Neural networks are regarded as systems whose connections are Lagrangian variables, namely functions depending on time. They are used to minimize the cognitive action, an appropriate functional index that is composed of a potential and of a kinetic term, that is shown to fit very well the classic machine learning criteria based on regularization.

The theory is applied to the construction of an unsupervised learning scheme for visual features in deep convolutional neural networks, where an appropriate Lagrangian term is used to enforce a solution where the features are developed under motion invariance. The causal optimization of the cognitive action yields a solution where learning is carried out by an opportune blurring of the video, along the interleaving of segments of null signal. Interestingly, this also sheds light on the video blurring process in newborns, as well as on the benefit from eye blinking and day-night rhythm.

Learning and inference are traditionally regarded as the two opposite, yet complementary and puzzling components of intelligence. In this talk we point out that a constrained-based modeling of the environmental agent interactions makes it possible to unify learning and inference within the same mathematical framework. The unification is based on the abstract notion of constraint, which provides a representation of knowledge granules gained from the interaction with the environment. The agents are based on a deep neural network architecture, and their learning and inferential processes are driven by different schemes for enforcing the environmental constraints. Logic constraints are also included thanks to their translation into real-valued functions that arises from the adoption of opportune t-norms.

The basic ideas are presented by simple case studies ranging from learning and inference in social nets, missing data, checking of logic constraints, and pattern generation. The theory offers a natural bridge between the formalization of knowledge and the inductive acquisition of concepts from data.

#### Biography

Yike Guo is a Professor of Computing Science in the Department of Computing at Imperial College London. He is the founding Director of the Data Science Institute at Imperial College, as well as leading the Discovery Science Group in the department. Professor Guo also holds the position of CTO of the tranSMART Foundation, a global open source community using and developing data sharing and analytics technology for translational medicine.

Professor Guo received a first-class honours degree in Computing Science from Tsinghua University, China, in 1985 and received his PhD in Computational Logic from Imperial College in 1993 under the supervision of Professor John Darlington. He founded InforSense, a software company for life science and health care data analysis, and served as CEO for several years before the company’s merger with IDBS, a global advanced R&D software provider, in 2009.

He has been working on technology and platforms for scientific data analysis since the mid-1990s, where his research focuses on knowledge discovery, data mining and large-scale data management. He has contributed to numerous major research projects including: the UK EPSRC platform project, Discovery Net; the Wellcome Trust-funded Biological Atlas of Insulin Resistance (BAIR); and the European Commission U-BIOPRED project. He is currently the Principal Investigator of the European Innovative Medicines Initiative (IMI) eTRIKS project, a €23M project that is building a cloud-based informatics platform, in which tranSMART is a core component for clinico-genomic medical research, and co-Investigator of Digital City Exchange, a £5.9M research programme exploring ways to digitally link utilities and services within smart cities.

Professor Guo has published over 200 articles, papers and reports. Projects he has contributed to have been internationally recognised, including winning the “Most Innovative Data Intensive Application Award” at the Supercomputing 2002 conference for Discovery Net, and the Bio-IT World “Best Practices Award” for U-BIOPRED in 2014. He is a Senior Member of the IEEE and is a Fellow of the British Computer Society.

#### Lectures

#### Biography

Peter Norvig is a Director of Research at Google Inc. Previously he was head of Google’s core search algorithms group, and of NASA Ames’s Computational Sciences Division, making him NASA’s senior computer scientist. He received the NASA Exceptional Achievement Award in 2001. He has taught at the University of Southern California and the University of California at Berkeley, from which he received a Ph.D. in 1986 and the distinguished alumni award in 2006. He was co-teacher of an Artifical Intelligence class that signed up 160,000 students, helping to kick off the current round of massive open online classes. His publications include the books Artificial Intelligence: A Modern Approach (the leading textbook in the field), Paradigms of AI Programming: Case Studies in Common Lisp, Verbmobil: A Translation System for Face-to-Face Dialog, and Intelligent Help Systems for UNIX. He is also the author of the Gettysburg Powerpoint Presentation and the world’s longest palindromic sentence. He is a fellow of the AAAI, ACM, California Academy of Science and American Academy of Arts & Sciences.

#### Lectures

We now know how to train a program to play world-champion-level chess or Go, just by simulating moves into the future and reinforcing the best ones. But we can’t train a human student to learn in the same way. This talk examines what we do know about applying data to education, and what might be coming in the future.

The software industry has built up a formidable set of tools for software development over the last half century. But we are just starting to understand what tools are needed to build software with machine learning, not hand-coding. Some of those tools will themselves make use of machine learning.

#### Biography

Panos M. Pardalos serves as distinguished professor of industrial and systems engineering at the University of Florida. Additionally, he is the Paul and Heidi Brown Preeminent Professor of industrial and systems engineering. He is also an affiliated faculty member of the computer and information science Department, the Hellenic Studies Center, and the biomedical engineering program. He is also the director of the Center for Applied Optimization. Pardalos is a world leading expert in global and combinatorial optimization. His recent research interests include network design problems, optimization in telecommunications, e-commerce, data mining, biomedical applications, and massive computing.

#### Lectures

The network revolution began in the 18th century when Euler solved the famous Konigsberg bridge problem. In the 19th century, Kirchhoff initiated the theory of electrical networks and

was the first person who defined the flow conservation equations, one of the milestones of network flow theory. After the invention of the telephone by Alexander Graham Bell in the 19th

century, the resulting applications stimulated network analysis. The field evolved dramatically after the 19th century. At present, our lives are both affected by and interface with the networks that connect us.

After a brief historical overview, the talk will focus on recent exciting developments and discuss future challenges not only in the science of networks but also in our network-driven and

changing society.

Large scale problems in engineering, in the design of networks and energy systems, the biomedical fields, and finance are modeled as optimization problems. Humans and nature are constantly optimizing to minimize costs or maximize profits, to maximize the flow in a network, or to minimize the probability of a blackout in a smart grid.

Due to new algorithmic developments and the computational power of machines (digital, analog, biochemical, quantum computers etc), optimization algorithms have been used to “solve” problems in a wide spectrum of applications in science and engineering.

But what do we mean by “solving” an optimization problem? What are the limits of what machines (and humans) can compute?

Financial markets, banks, currency exchanges and other institutions can be modeled and analyzed as network structures where nodes are any agents such as companies, shareholders, currencies, or countries. The edges (can be weighted, oriented, etc.) represent any type of relations between agents, for example, ownership, friendship, collaboration, influence, dependence, and correlation. We are going to discuss network and data sciences techniques to study the dynamics of financial markets and other problems in economics.

#### Biography

Professor Alex “Sandy” Pentland directs the MIT Connection Science and Human Dynamics labs and previously helped create and direct the MIT Media Lab and the Media Lab Asia in India. He is one of the most-cited scientists in the world, and Forbes recently declared him one of the “7 most powerful data scientists in the world” along with Google founders and the Chief Technical Officer of the United States. He has received numerous awards and prizes such as the McKinsey Award from Harvard Business Review, the 40th Anniversary of the Internet from DARPA, and the Brandeis Award for work in privacy.

He is a founding member of advisory boards for Google, AT&T, Nissan, and the UN Secretary General, a serial entrepreneur who has co-founded more than a dozen companies including social enterprises such as the Data Transparency Lab, the Harvard-ODI-MIT DataPop Alliance and the Institute for Data Driven Design. He is a member of the U.S. National Academy of Engineering and leader within the World Economic Forum.

Over the years Sandy has advised more than 60 PhD students. Almost half are now tenured faculty at leading institutions, with another one-quarter leading industry research groups and a final quarter founders of their own companies. Together Sandy and his students have pioneered computational social science, organizational engineering, wearable computing (Google Glass), image understanding, and modern biometrics. His most recent books are `Social Physics,’ published by Penguin Press, and ‘Honest Signals‘, published by MIT Press.

Interesting experiences include dining with British Royalty and the President of India, staging fashion shows in Paris, Tokyo, and New York, and developing a method for counting beavers from space.

#### Lectures

Human behavior is as much a function of social influence as personal thought. Consequently observations of interactions (including simple observation or others) and the behaviors of others can accurately predict the future behavior of individuals. I will review the evidence for these effects, and show how to construct predictive models from observations. I will then review the use of various data sources and methods of analysis in terms of their usefulness for human behavior prediction.

The availability of data from phones, credit cards, automobiles and other modern technologies places personal privacy at risk. Regulations such as GDPR are improving the situation, however many worry that these new restrictions will prevent prevent improvements in health, decision making, and other civil systems. Another worry is that these data are held by different institutions, so that it is difficult to achieve a wholistic view of the situation. I will explain my work that led to GDPR, and methods of handling and analyzing data that protect both privacy and proprietary data and yet provide state-of-the-art insights.

#### Biography

Marc’Aurelio Ranzato is a Research Scientist at the Facebook AI Research lab in New York City. His research interests are in the area of unsupervised learning, continual learning and transfer learning, with applications to vision, natural language understanding and speech recognition.

Marc’Aurelio has earned a PhD in Computer Science at New York University under Yann LeCun’s supervision. After a post-doc with Geoffrey Hinton at University of Toronto, he joined the Google Brain team in 2011. In 2013 he joined Facebook and was a founding member of the Facebook AI Research lab.

Marc’Aurelio has served as Senior Program Chair for ICLR in 2017, and Program Chair for ICLR in 2018. He has also served as Area Chair for several international conferences such as NIPS, ICML, CVPR and ICCV.

#### Lectures

Machine Translation (MT) is a flagship of the recent successes and advances in the field of natural language processing. Its practical applications and use as a testbed for sequence transduction algorithms have spurred renewed interest in this topic. While MT systems have shown to achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large

amounts of parallel sentences, which hinders their applicability to the majority of language pairs.

In these lectures, I will first give a brief overview of how Deep Learning is used for text applications, and in particular for MT. Then, I will discuss our recent work on learning to translate with access to only large monolingual corpora in each language, but no example of sentences with their corresponding translations.

We propose two model variants, a neural and a phrase-based model. Although these models operate quite differently, they are both based on the same principles, namely careful initialization of parameters, the use of powerful language models and the artificial generation of parallel data.

I will show how these approaches enable translations from similar languages as well as languages that do not even share the same alphabet and linguistic structure, opening the door towards systems that can translate myriads of language pairs for which we have little if any bitexts.

Machine Translation (MT) is a flagship of the recent successes and advances in the field of natural language processing. Its practical applications and use as a testbed for sequence transduction algorithms have spurred renewed interest in this topic. While MT systems have shown to achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large

amounts of parallel sentences, which hinders their applicability to the majority of language pairs.

In these lectures, I will first give a brief overview of how Deep Learning is used for text applications, and in particular for MT. Then, I will discuss our recent work on learning to translate with access to only large monolingual corpora in each language, but no example of sentences with their corresponding translations.

We propose two model variants, a neural and a phrase-based model. Although these models operate quite differently, they are both based on the same principles, namely careful initialization of parameters, the use of powerful language models and the artificial generation of parallel data.

I will show how these approaches enable translations from similar languages as well as languages that do not even share the same alphabet and linguistic structure, opening the door towards systems that can translate myriads of language pairs for which we have little if any bitexts.

Machine Translation (MT) is a flagship of the recent successes and advances in the field of natural language processing. Its practical applications and use as a testbed for sequence

transduction algorithms have spurred renewed interest in this topic. Despite great successes of neural models for MT, there is still a general lack of understanding of how these models work and fit the data distribution. In particular, MT is inherently a one-to-many learning task, as there are several plausible translations of the same source sentence. Uncertainty in the prediction task is due to both the existence of multiple valid translations for a single source sentence, and the extrinsic uncertainty caused by noise in the training data.

In the first part of this talk, I will present tools and metrics to assess how uncertainty in the data is captured by the model distribution and how it affects search strategies that

generate translations.

In the second part of the talk, I will instead focus on another challenge, which is the discrepancy between how neural MT models are trained, and how they are used at test time.

At training time, these models are only asked to predict the next word in the sentence, while at test time they are asked to predict the whole sentence (translation), which may lead to accumulation of errors. I will conclude the lecture with a survey of methods to train neural MT systems at the sequence-level, showing that classical structure prediction losses are quite effective, although with diminishing returns as baseline systems get stronger.