NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 10.03.21

Glomar

Ricky Costa
5 min readOct 3, 2021

--

Hey Welcome back. Loads of NLP research and code came in this week. But first… is your location actually hidden?… 💽

How to Train Really Large Models on Many GPUs?

ToC

Training Parallelism

Mixture-of-Experts (MoE)

Other Memory Saving Designs

If you need to know more about the innards of GPUs 👇

PLATO-XL: World’s First 11 Billion Parameter Pre-Trained Dialogue Generation Model?

Blog: http://research.baidu.com/Blog/index-view?id=163

Paper: https://arxiv.org/pdf/2109.09519.pdf

For the Python GUI Inclined…

“PySimpleGUI is a Python package that enables Python programmers of all levels to create GUIs.”

Repo for Measuring Data Quality

Repo goes after:

  • Bias and Fairness: Guarantees that data is not biased and its application is fair concerning sensitive attributes for which there are legal and ethical obligations not to differentiate the treatment (e.g., gender, race).
  • Data Expectations: Unit tests for data that assert a particular property. Leverage Great Expectations validations, integrate their outcomes in our framework and check the quality of the validations.
  • Data Relations: Checks the association between features, test for causality effects, estimate the feature importance, and detect features with high collinearity.
  • Drift Analysis: Often, with time, different patterns may evolve from the data. Using this module, you can check the stability of your features (i.e., covariates) and target (i.e., label) as you look at different chunks of data.
  • Duplicates: Data may come from different sources and is not always unique. This module checks for repeated entries in data that are redundant and can (or should) be dropped.
  • Labellings: With specialized engines for categorical and numerical targets, this module provides a test suite that detects both common (e.g., imbalanced labels) and complex analysis (e.g., label outliers).
  • Missings: Missing values can cause multiple problems in data applications. With this module, you can better understand the severity of their impact and how they occur.
  • Erroneous Data: Data may contain values without inherent meaning. With this module, you can inspect your data (tabular and time-series) for typical misguided values on data points.

FastAPI Deployment Docs Update

K8s included:

Colab of the Week

Wanna play with GPT-J 6B? NovelAI finetuned EleutherAI’s GPT-J 6B on 4GB of Python for code generation. It is able to fit on a 16GB GPU VRAM with FP16.

Model Card

New SOTA OCR and Transformers Library 😎

Transformer-based OCR model for text recognition with pre-trained models. It currently leverages the fairseq library but authors plan to convert to Hugging Face according to their repo on GitHub.

The TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.

Paper:

https://arxiv.org/pdf/2109.10282.pdf

Papers to Read 📚

https://arxiv.org/pdf/2109.15144.pdf
https://arxiv.org/pdf/2109.12424.pdf

2021 Accelerate State of DevOps report

Google Colab with A100s ? Is this Fake News ?

Someone got an A100 GPU (40GB VRAM) with a Colab Pro+ Account 🤯

XGBoost for Spam Detection: With Code

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

iFᴀᴄᴇᴛSᴜᴍ

iFACETSUM, is a web app that integrates interactive summarization together with faceted search, by providing a novel faceted navigation scheme that yields abstractive summaries for the user’s selections.

Connected Papers 📈

RAFT: A Real-World Few-Shot Text Classification Benchmark

RAFT is a few-shot classification benchmark that tests language models:

- across multiple domains (lit reviews, medical data, tweets, customer interaction, etc.)

- on economically valuable classification tasks (someone inherently cares about the task)

- with evaluation that mirrors deployment (50 labeled examples per task, info retrieval allowed, hidden test set)

Connected Papers 📈

--

--

Ricky Costa

Subscribe to the NLP Cypher newsletter for the latest in NLP & ML code/research. 🤟