Welcome back! Hope you had a great week. We have a new leader on the SuperGLUE benchmark with a new Ernie model from Baidu comprising of 10 billion parameters trained on on a 4TB corpus. FYI, human baseline was already beat by Microsoft’s DeBERTa model at the beginning of the year… time for a new SuperSuperGLUE benchmark???
SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding…
The Codex Paper
BTW, if you are still interested in GitHub’s CoPilot, I stumbled upon the Codex paper this week:
DeepMind’s Perceiver transformer allows it to take a variety of modalities (vision, audio, text) as its input and able to achieve competitive outcomes in benchmark performance. Usually a model architecture is specialized to a specific domain, however what the Perceiver is attempting to do here is being able to generalize to any domain using a single architecture. 😎
The Long-Short Transformer
Adding to the list of efficient transformers, comes the LS-Transformer that be both used for autoregressive and bi-directional models and for both language and vision domains. Model obtains SOTA results on the Long Range Arena, char-level language modeling and ImageNet classification.
Deep Learning Videos
170 video lectures from Sebastian Raschka in 2021 using PyTorch.
Table of Contents
- Part 1: Introduction
- L01: Introduction to deep learning
- L02: The brief history of deep learning
- L03: Single-layer neural networks: The perceptron algorithm
- Part 2: Mathematical and computational foundations
- L04: Linear algebra and calculus for deep learning
- L05: Parameter optimization with gradient descent
- L06: Automatic differentiation with PyTorch
- L07: Cluster and cloud computing resources
- Part 3: Introduction to neural networks
- L08: Multinomial logistic regression / Softmax regression
- L09: Multilayer perceptrons and backpropration
- L10: Regularization to avoid overfitting
- L11: Input normalization and weight initialization
- L12: Learning rates and advanced optimization algorithms
- Part 4: Deep learning for computer vision and language modeling
- L13: Introduction to convolutional neural networks
- L14: Convolutional neural networks architectures
- L15: Introduction to recurrent neural networks
- Part 5: Deep generative models
- L16: Autoencoders
- L17: Variational autoencoders
- L18: Introduction to generative adversarial networks
- L19: Self-attention and transformer networks
Introduction to Deep Learning
I just sat down this morning and organized all deep learning related videos I recorded in 2021. I am sure this will be…
Python Deep Learning Notebooks
Jupyter notebooks implementing the code samples found in the book Deep Learning with Python, 2nd Edition.
This repository contains Jupyter notebooks implementing the code samples found in the book Deep Learning with Python…
Hugging Face’s Model Parallelism Intro
A conceptual intro to model parallelism touching on several techniques highlighted below. HF also highlights which of the techniques are currently implemented in their library.
- DataParallel (DP) — the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step.
- TensorParallel (TP) — each tensor is split up into multiple chunks, so instead of having the whole tensor reside on a single gpu, each shard of the tensor resides on its designated gpu. During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step. This is what one may call horizontal parallelism, as the splitting happens on horizontal level.
- PipelineParallel (PP) — the model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single gpu. Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch.
- Zero Redundancy Optimizer (ZeRO) — Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model does’t need to be modified. It also supports various offloading techniques to compensate for limited GPU memory.
- Sharded DDP — is another name for the foundational ZeRO concept as used by various other implementations of ZeRO.
State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0. Transformers provides thousands of…
Faster Inference in Haystack’s QA System
Reducing the ‘top_k_retriever’ parameter is the trick here. This parameter represents the number of documents the reader model evaluates.
Parameter-Tweaking: Get Faster Answers from Your Haystack Pipeline
This article is the first in our series on optimizing your Haystack question answering system. We’ll link to the other…
Common Errors in Training Data
Blog post reviewing three situations where your data goes wrong:
- Labeling Errors
- Unbalanced Training Data
- Bias in Labeling Process
Types of Errors We See with Training Data: How to Recognize and Avoid Common Data Error
It’s helpful to contrast AI development with traditional software development. In traditional software, you write code…
Introducing spaCy v3.1 · Explosion
It's been great to see the adoption of spaCy v3, which introduced transformer-based pipelines, a new training system…
Release v2.1.0 · Adapter-Hub/adapter-transformers
Based on transformers v4.8.2 Add support for loading adapters from HuggingFace Model Hub (@calpt via #162) Add method…
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
A new way to generalize and analyze data representations of graph structure of a dataset while keeping the same prediction capabilities of an attention based encoder-decoder model.
This repository is the implementation of the Power Law Graph Transformer (PLGT) detailed in the research article: Power…
Transformer inference quadratically scales with the input sequence length. This makes it difficult to use transformers for processing long sequences. Learned Token Pruning (LTP) is a method that reduces redundant tokens as the data passes through the different layers of the transformer.
Check our paper for more details. We follow the same installation procedure as the original Huggingface transformer…
Using transformers for the conversational task of dialog act recognition.
A library for working with dialog acts. The preferred way to use daseg is with an anaconda environment. We tested it…
An application supporting customizable training of diachronic word embeddings with the TWEC model.
DRIFT is a tool for Diachronic Analysis of Scientific Literature. The application offers user-friendly and customizable…
Keep It Simple (KiS)
An approach to unsupervised text simplification which learns to balance a reward across three properties: fluency, salience and simplicity.
This repository contains the code for ACL2021 paper: Keep It Simple: Unsupervised Simplification of Multi-Paragraph…
Neural Rap Generation with Rhyme and Rhythm Modeling. 😁
Measurement of BEAT in Generated Samples
We randomly generated more than 5,000 samples using DeepRapper and DeepRapper with beat frequency control. We compared…
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat