NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 10.03.21

Glomar

Ricky Costa

5 min readOct 3, 2021

Hey Welcome back. Loads of NLP research and code came in this week. But first… is your location actually hidden?… 💽

GitHub — z0ccc/LocateJS: Check if your location is actually hidden.

Check it out here: https://z0ccc.github.io/LocateJS/. LocateJS predicts your location by analyzing your connection and…

github.com

How to Train Really Large Models on Many GPUs?

How to train large and deep neural networks is challenging, as it demands a large amount of GPU memory and a long…

lilianweng.github.io

ToC

Training Parallelism

Mixture-of-Experts (MoE)

Other Memory Saving Designs

If you need to know more about the innards of GPUs 👇

Gentle introduction to GPUs inner workings

This article summarizes some lower level aspect of how GPU executes. Although GPU programming is not that complicated…

vksegfault.github.io

PLATO-XL: World’s First 11 Billion Parameter Pre-Trained Dialogue Generation Model?

Blog: http://research.baidu.com/Blog/index-view?id=163

Paper: https://arxiv.org/pdf/2109.09519.pdf

For the Python GUI Inclined…

“PySimpleGUI is a Python package that enables Python programmers of all levels to create GUIs.”

GitHub - PySimpleGUI/PySimpleGUI: Launched in 2018 Actively developed & supported. Supports…

Launched in 2018 Actively developed & supported. Supports tkinter, Qt, WxPython, Remi (in browser). Create custom GUI…

github.com

Repo for Measuring Data Quality

Repo goes after:

Bias and Fairness: Guarantees that data is not biased and its application is fair concerning sensitive attributes for which there are legal and ethical obligations not to differentiate the treatment (e.g., gender, race).
Data Expectations: Unit tests for data that assert a particular property. Leverage Great Expectations validations, integrate their outcomes in our framework and check the quality of the validations.
Data Relations: Checks the association between features, test for causality effects, estimate the feature importance, and detect features with high collinearity.
Drift Analysis: Often, with time, different patterns may evolve from the data. Using this module, you can check the stability of your features (i.e., covariates) and target (i.e., label) as you look at different chunks of data.
Duplicates: Data may come from different sources and is not always unique. This module checks for repeated entries in data that are redundant and can (or should) be dropped.
Labellings: With specialized engines for categorical and numerical targets, this module provides a test suite that detects both common (e.g., imbalanced labels) and complex analysis (e.g., label outliers).
Missings: Missing values can cause multiple problems in data applications. With this module, you can better understand the severity of their impact and how they occur.
Erroneous Data: Data may contain values without inherent meaning. With this module, you can inspect your data (tabular and time-series) for typical misguided values on data points.

How Can I Measure Data Quality?

Introducing YData Quality: An open-source package for comprehensive Data Quality.

towardsdatascience.com

FastAPI Deployment Docs Update

K8s included:

Colab of the Week

Wanna play with GPT-J 6B? NovelAI finetuned EleutherAI’s GPT-J 6B on 4GB of Python for code generation. It is able to fit on a 16GB GPU VRAM with FP16.

Google Colaboratory

Edit description

colab.research.google.com

Model Card

NovelAI/genji-python-6B · Hugging Face

For example usage or to easily use the model you can check our colab notebook: Notebook Genji is a transformer model…

huggingface.co

New SOTA OCR and Transformers Library 😎

Transformer-based OCR model for text recognition with pre-trained models. It currently leverages the fairseq library but authors plan to convert to Hugging Face according to their repo on GitHub.

The TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.

unilm/trocr at master · microsoft/unilm

TrOCR is an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, which…

github.com

Paper:

https://arxiv.org/pdf/2109.10282.pdf

Papers to Read 📚

https://arxiv.org/pdf/2109.15144.pdf

https://arxiv.org/pdf/2109.12424.pdf

2021 Accelerate State of DevOps report

Announcing DORA 2021 Accelerate State of DevOps report | Google Cloud Blog

Over the past seven years, more than 32,000 professionals worldwide have taken part in the Accelerate State of DevOps…

cloud.google.com

Google Colab with A100s ? Is this Fake News ?

Someone got an A100 GPU (40GB VRAM) with a Colab Pro+ Account 🤯

XGBoost for Spam Detection: With Code

Building Your First NLP Application to Detect SPAM

Despite the complexity of human language, NLP teaches us techniques to break language down semantically and…

blog.paperspace.com

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding

Repo contains implementations of a number of state-of-the-art methods and data processing, a standard training procedure and an evaluation framework for few-shot NLU.

GitHub - THUDM/FewNLU

Few-shot natural language understanding has attracted much recent attention. However, prior methods have been evaluated…

github.com

Connected Papers 📈

MFAQ: a Multilingual FAQ Dataset

MFAQ is a multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.

GitHub - clips/mfaq: MFAQ: a Multilingual FAQ Dataset

MFAQ is a multilingual FAQ retrieval dataset. We also release a multilingual FAQ retrieval model trained on this…

github.com

Connected Papers 📈

iFᴀᴄᴇᴛSᴜᴍ

iFACETSUM, is a web app that integrates interactive summarization together with faceted search, by providing a novel faceted navigation scheme that yields abstractive summaries for the user’s selections.

GitHub - BIU-NLP/iFACETSUM: Corpus exploration platform using advanced tools such as interactive…

iFᴀᴄᴇᴛSᴜᴍ is an interactive faceted summarization approach and system for navigating within a large document-set on a…

github.com

Connected Papers 📈

RAFT: A Real-World Few-Shot Text Classification Benchmark

RAFT is a few-shot classification benchmark that tests language models:
- across multiple domains (lit reviews, medical data, tweets, customer interaction, etc.)
- on economically valuable classification tasks (someone inherently cares about the task)
- with evaluation that mirrors deployment (50 labeled examples per task, info retrieval allowed, hidden test set)

GitHub - oughtinc/raft-baselines

This is the repository for the GPT-3 baselines described in the RAFT benchmark paper. Set up a virtual environament and…

github.com

Connected Papers 📈

MADE (Multi-Adapter Dataset Experts)

Multiple adapters finetuned on different reading comprehension datasets while sharing the same transformer.

GitHub - princeton-nlp/MADE: EMNLP 2021: Single-dataset Experts for Multi-dataset…

This repository contains the implementation of MADE ( Multi- adapter dataset experts), which is described in the paper…

github.com

Connected Papers 📈

Quantum Stat

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 10.03.21

Glomar

GitHub — z0ccc/LocateJS: Check if your location is actually hidden.

Check it out here: https://z0ccc.github.io/LocateJS/. LocateJS predicts your location by analyzing your connection and…

How to Train Really Large Models on Many GPUs?

How to Train Really Large Models on Many GPUs?

How to train large and deep neural networks is challenging, as it demands a large amount of GPU memory and a long…

If you need to know more about the innards of GPUs 👇

Gentle introduction to GPUs inner workings

This article summarizes some lower level aspect of how GPU executes. Although GPU programming is not that complicated…

PLATO-XL: World’s First 11 Billion Parameter Pre-Trained Dialogue Generation Model?

For the Python GUI Inclined…

GitHub - PySimpleGUI/PySimpleGUI: Launched in 2018 Actively developed & supported. Supports…

Launched in 2018 Actively developed & supported. Supports tkinter, Qt, WxPython, Remi (in browser). Create custom GUI…

Repo for Measuring Data Quality

How Can I Measure Data Quality?

Introducing YData Quality: An open-source package for comprehensive Data Quality.

FastAPI Deployment Docs Update

Colab of the Week

Google Colaboratory

Edit description

NovelAI/genji-python-6B · Hugging Face

For example usage or to easily use the model you can check our colab notebook: Notebook Genji is a transformer model…

New SOTA OCR and Transformers Library 😎

unilm/trocr at master · microsoft/unilm

TrOCR is an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, which…

Papers to Read 📚

2021 Accelerate State of DevOps report

Announcing DORA 2021 Accelerate State of DevOps report | Google Cloud Blog

Over the past seven years, more than 32,000 professionals worldwide have taken part in the Accelerate State of DevOps…

Google Colab with A100s ? Is this Fake News ?

XGBoost for Spam Detection: With Code

Building Your First NLP Application to Detect SPAM

Despite the complexity of human language, NLP teaches us techniques to break language down semantically and…

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding

GitHub - THUDM/FewNLU

Few-shot natural language understanding has attracted much recent attention. However, prior methods have been evaluated…

MFAQ: a Multilingual FAQ Dataset

GitHub - clips/mfaq: MFAQ: a Multilingual FAQ Dataset

MFAQ is a multilingual FAQ retrieval dataset. We also release a multilingual FAQ retrieval model trained on this…

iFᴀᴄᴇᴛSᴜᴍ

GitHub - BIU-NLP/iFACETSUM: Corpus exploration platform using advanced tools such as interactive…

iFᴀᴄᴇᴛSᴜᴍ is an interactive faceted summarization approach and system for navigating within a large document-set on a…

RAFT: A Real-World Few-Shot Text Classification Benchmark

GitHub - oughtinc/raft-baselines

This is the repository for the GPT-3 baselines described in the RAFT benchmark paper. Set up a virtual environament and…

MADE (Multi-Adapter Dataset Experts)

GitHub - princeton-nlp/MADE: EMNLP 2021: Single-dataset Experts for Multi-dataset…

This repository contains the implementation of MADE ( Multi- adapter dataset experts), which is described in the paper…

Written by Ricky Costa