## NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

# The NLP Cypher | 07.11.21

## Plata o Plomo

Welcome back! Hope you had a great week. We have a new leader on the SuperGLUE benchmark with a new Ernie model from Baidu comprising of 10 billion parameters trained on on a 4TB corpus. FYI, human baseline was already beat by Microsoft’s DeBERTa model at the beginning of the year… time for a new SuperSuperGLUE benchmark???

# The Codex Paper

BTW, if you are still interested in GitHub’s CoPilot, I stumbled upon the Codex paper this week:

# DeepMind’s Perceiver

DeepMind’s Perceiver transformer allows it to take a variety of modalities (vision, audio, text) as its input and able to achieve competitive outcomes in benchmark performance. Usually a model architecture is specialized to a specific domain, however what the Perceiver is attempting to do here is being able to generalize to any domain using a single architecture. 😎

# The Long-Short Transformer

Adding to the list of efficient transformers, comes the LS-Transformer that be both used for autoregressive and bi-directional models and for both language and vision domains. Model obtains SOTA results on the Long Range Arena, char-level language modeling and ImageNet classification.

**Paper:**

# Deep Learning Videos

170 video lectures from Sebastian Raschka in 2021 using PyTorch.

**Table of Contents**

- Part 1: Introduction
- L01: Introduction to deep learning
- L02: The brief history of deep learning
- L03: Single-layer neural networks: The perceptron algorithm
- Part 2: Mathematical and computational foundations
- L04: Linear algebra and calculus for deep learning
- L05: Parameter optimization with gradient descent
- L06: Automatic differentiation with PyTorch
- L07: Cluster and cloud computing resources
- Part 3: Introduction to neural networks
- L08: Multinomial logistic regression / Softmax regression
- L09: Multilayer perceptrons and backpropration
- L10: Regularization to avoid overfitting
- L11: Input normalization and weight initialization
- L12: Learning rates and advanced optimization algorithms
- Part 4: Deep learning for computer vision and language modeling
- L13: Introduction to convolutional neural networks
- L14: Convolutional neural networks architectures
- L15: Introduction to recurrent neural networks
- Part 5: Deep generative models
- L16: Autoencoders
- L17: Variational autoencoders
- L18: Introduction to generative adversarial networks
- L19: Self-attention and transformer networks

# Python Deep Learning Notebooks

Jupyter notebooks implementing the code samples found in the book Deep Learning with Python, 2nd Edition.

# Hugging Face’s Model Parallelism Intro

A conceptual intro to model parallelism touching on several techniques highlighted below. HF also highlights which of the techniques are currently implemented in their library.

- DataParallel (DP) — the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step.
- TensorParallel (TP) — each tensor is split up into multiple chunks, so instead of having the whole tensor reside on a single gpu, each shard of the tensor resides on its designated gpu. During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step. This is what one may call horizontal parallelism, as the splitting happens on horizontal level.
- PipelineParallel (PP) — the model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single gpu. Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch.
- Zero Redundancy Optimizer (ZeRO) — Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model does’t need to be modified. It also supports various offloading techniques to compensate for limited GPU memory.
- Sharded DDP — is another name for the foundational ZeRO concept as used by various other implementations of ZeRO.

**Source**

# Faster Inference in Haystack’s QA System

Reducing the ‘top_k_retriever’ parameter is the trick here.* *This parameter represents the number of documents the reader model evaluates.

# Common Errors in Training Data

Blog post reviewing three situations where your data goes wrong:

- Labeling Errors
- Unbalanced Training Data
- Bias in Labeling Process

# Software Updates

## spaCy 3.1:

## Adapters 2.1.0:

# Repo Cypher 👨💻

## A collection of recently released repos that caught our 👁

## Power Law Graph Transformer

A new way to generalize and analyze data representations of graph structure of a dataset while keeping the same prediction capabilities of an attention based encoder-decoder model.

## Learned Token Pruning

Transformer inference quadratically scales with the input sequence length. This makes it difficult to use transformers for processing long sequences. Learned Token Pruning (LTP) is a method that reduces redundant tokens as the data passes through the different layers of the transformer.

## Daseg

Using transformers for the conversational task of dialog act recognition.

## DRIFT Library

An application supporting customizable training of diachronic word embeddings with the TWEC model.

## Keep It Simple (KiS)

An approach to unsupervised text simplification which learns to balance a reward across three properties: fluency, salience and simplicity.

## DeepRapper

Neural Rap Generation with Rhyme and Rhythm Modeling.

😁

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat