NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 06.13.21

Vintage

Ricky Costa

Published in

Geek Culture

7 min readJun 13, 2021

Welcome back! EleutherAI has a brand new (and big) GPT model that was open-sourced over this past week. The model (JAX-based) was trained for 5 weeks on the Pile dataset, Eleuther’s own ~800GB data dump. The model is called GPT-J, a 6 billion parameter model that rivals the performance of GPT-3 of the same size. And apparently it performs well on code generation:

Here’s a comparison of all the major language models on various datasets:

EleutherAI has a demo webpage for you to try out the model:

EleutherAI - text generation testing UI

EleutherAI web app testing for language models

6b.eleuther.ai

And a Colab for inference over TPUs 😁:

Google Colaboratory

Edit description

colab.research.google.com

Want to thank Connected Papers for the shout-out this week! 😎

FYI, after the upcoming NLP Index update, we’ll pass the 6,000 repo mark! 🚀

TextStyleBrush

TextStyleBrush can recognize style of text in pictures and edit the words while maintaining the style.

It’s “… the first self-supervised AI model that replaces text in images of both handwriting and scenes — in one shot — using a single example word.”

Examples:

AI can now emulate text style in images in one shot - using just a single word

We're introducing TextStyleBrush, an AI research project that can copy the style of text in a photo using just a single…

ai.facebook.com

Getting Started with Tensorflow-Metal PluggableDevice

Install TensorFlow v2.5 and the tensorflow-metal PluggableDevice to accelerate training with Metal on Mac GPUs.

Tensorflow Plugin - Metal - Apple Developer

Find presentations, documentation, sample code, and resources for building macOS, iOS, and tvOS apps with the Metal…

developer.apple.com

Do You Really Need Redis? How to Get Away with Just PostgreSQL

Chris Farber highlights how to use Postgres for common Redis use-cases. In all, he describes 3 use-cases of job-queuing, application locks, and pub/sub! Have to say, the pub/sub example was surprising:

Do You Need Redis? PostgreSQL Does Queuing, Locking, & Pub/Sub

There’s a tried-and-true architecture that I’ve seen many times for supporting your web services and applications…

spin.atomicobject.com

Reasoning with Knowledge Graphs (Slides)

Goes over two papers:

Reasoning with Language Models and Knowledge Graphs for Question Answering https://arxiv.org/abs/2104.06378

Multi-hop logical reasoning on KGs https://arxiv.org/abs/2010.11465

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

Extractive Research Slide Generation Using Windowed Labeling Ranking

A method to automatically generate slides for scientific papers based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites.

atharsefid/Extractive_Research_Slide_Generation_Using_Windowed_Labeling_Ranking

This article is published at the Scientific Scholarly Processing (SDP) 2021 workshop. Download the original papers and…

github.com

Connected Papers 📈

Cross-Document Coreference Resolution

The first end-to-end model for cross-document (CD) coreference resolution from raw text, which extends the prominent model for withindocument coreference to the CD setting.

ariecattan/coref

This repository contains code and models for end-to-end cross-document coreference resolution, as decribed in our…

github.com

Connected Papers 📈

Stackoverflow Code Generation Using BART

A corpus of over 40,000 StackOverflow question texts to be used in conjunction with their corresponding intents from the CoNaLa dataset.

gabeorlanski/stackoverflow-encourages-cheating

This is the repository for the paper Reading StackOverflow Encourages Cheating: Adding Question TextImproves Extractive…

github.com

Colab:

Google Colaboratory

Edit description

colab.research.google.com

Connected Papers 📈

Text-to-SQL in the Wild

A dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website.

hirupert/sede

Code and data from the paper: Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data.

github.com

Connected Papers 📈

Capturing Row and Column Semantics in Transformer Based Question Answering over Tables

IBM/row-column-intersection

This project makes available the code and data from our NAACL paper: "Capturing Row and Column Semantics in Transformer…

github.com

Connected Papers 📈

Framework for Evaluating Open-domain Chatbot Consistency

Addressing Inquiries about History (AIH) contains two stages: (1) during the inquiry stage, questions about the facts and opinions mentioned in the conversation history are inserted into the conversation between chatbots. (2) during the contradiction recognition stage, the responses of the inserted questions are collected, and automatic models or human judges can be adopted to decide whether the responses are consistent with the dialogue history.

ictnlp/AIH

This repository contains the code for Findings of ACL 2021 paper Addressing Inquiries about History: An Efficient and…

github.com

Connected Papers 📈

Few-Shot Intent Detection

Few-shot intent detection with/without Out-of-Scope (OOS) intents.

jianguoz/Few-Shot-Intent-Detection

Few-Shot-Intent-Detection is a repository designed for few-shot intent detection with/without Out-of-Scope (OOS)…

github.com

Connected Papers 📈

SciFive: Text-to-Text Framework for Biomedical Literature

A domain-specific T5 model that has been pre-trained on large biomedical corpora. Model outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) on tasks in named entity relation, relation extraction, natural language inference, and question answering.

justinphan3110/SciFive

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework…

github.com

Connected Papers 📈

FastSeq: Sequence Model Library

Provides implementation of sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc. It automatically optimizes inference speed based on popular NLP toolkits (e.g. FairSeq and HuggingFace-Transformers) without accuracy loss.

microsoft/fastseq

FastSeq FastSeq provides efficient implementation of popular sequence models (e.g. Bart, ProphetNet) for text…

github.com

Connected Papers 📈

Python Programming Puzzles

A dataset of python programming puzzles which can be used to teach and evaluate an AI’s programming proficiency.

microsoft/PythonProgrammingPuzzles

This repo contains a dataset of python programming puzzles which can be used to teach and evaluate an AI's programming…

github.com

Connected Papers 📈

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks

XtremeDistilTransformers comes with Tensorflow 2.3 and HuggingFace Transformers with an unified API with the following features:

Distil any supported pre-trained language models as teachers (e.g, Bert, Electra, Roberta)
Initialize student model with any pre-trained model (e.g, MiniLM, DistilBert, TinyBert), or initialize from scratch
Multilingual text classification and sequence tagging
Distil multiple hidden states from teacher
Distil deep attention networks from teacher
Pairwise and instance-level classification tasks (e.g, MNLI, MRPC, SST)
Progressive knowledge transfer with gradual unfreezing
Fast mixed precision training for distillation (e.g, mixed_float16, mixed_bfloat16)
ONNX runtime inference

microsoft/xtreme-distil-transformers

Releasing [ XtremeDistilTransformers] with Tensorflow 2.3 and HuggingFace Transformers with an unified API with the…

github.com

Connected Papers 📈

Swords ⚔️: Stanford Word Substitution Benchmark

A new benchmark for lexical substitution, the task of finding appropriate substitutes for a target word in a context.

p-lambda/swords

This repository houses the Stanford Word Substitution (Swords) benchmark. Swords ⚔️ is a benchmark for the task of…

github.com

Connected Papers 📈

Fact Extraction and VERification Over Unstructured and Structured information

Dataset consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict.

Raldir/FEVEROUS

This repository maintains the code to generate and prepare the dataset, as well as the code of the annotation platform…

github.com

Connected Papers 📈

AUGNLG: Few-shot Natural Language Generation using Self-trained
Data Augmentation

A data augmentation approach that combines a self-trained neural retrieval model with a few-shot learned NLU model, to automatically create MR-to-Text data from open-domain texts.

XinnuoXu/AugNLG

Code for paper " Xinnuo Xu, Guoyin Wang, Young-Bum Kim, Sungjin Lee AUGNLG: Few-shot Natural Language Generation using…

github.com

Connected Papers 📈

Dataset of the Week: FLORES

What is it?

An evaluation benchmark for low-resource and multilingual machine translation. It’s a many-to-many multilingual translation benchmark dataset consisting of 3,001 sentences extracted from English Wikipedia and covering a variety of different topics and domains for 101 languages.

Where is it?

facebookresearch/flores

FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages. Looking for FLORESv1, which…

github.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 06.13.21

Vintage

EleutherAI - text generation testing UI

EleutherAI web app testing for language models

Google Colaboratory

Edit description

TextStyleBrush

AI can now emulate text style in images in one shot - using just a single word

We're introducing TextStyleBrush, an AI research project that can copy the style of text in a photo using just a single…

Getting Started with Tensorflow-Metal PluggableDevice

Tensorflow Plugin - Metal - Apple Developer

Find presentations, documentation, sample code, and resources for building macOS, iOS, and tvOS apps with the Metal…

Do You Really Need Redis? How to Get Away with Just PostgreSQL

Do You Need Redis? PostgreSQL Does Queuing, Locking, & Pub/Sub

There’s a tried-and-true architecture that I’ve seen many times for supporting your web services and applications…

Reasoning with Knowledge Graphs (Slides)

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

atharsefid/Extractive_Research_Slide_Generation_Using_Windowed_Labeling_Ranking

This article is published at the Scientific Scholarly Processing (SDP) 2021 workshop. Download the original papers and…

ariecattan/coref

This repository contains code and models for end-to-end cross-document coreference resolution, as decribed in our…

gabeorlanski/stackoverflow-encourages-cheating

This is the repository for the paper Reading StackOverflow Encourages Cheating: Adding Question TextImproves Extractive…

Google Colaboratory

Edit description

hirupert/sede

Code and data from the paper: Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data.

IBM/row-column-intersection

This project makes available the code and data from our NAACL paper: "Capturing Row and Column Semantics in Transformer…

ictnlp/AIH

This repository contains the code for Findings of ACL 2021 paper Addressing Inquiries about History: An Efficient and…

jianguoz/Few-Shot-Intent-Detection

Few-Shot-Intent-Detection is a repository designed for few-shot intent detection with/without Out-of-Scope (OOS)…

justinphan3110/SciFive

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework…

microsoft/fastseq

FastSeq FastSeq provides efficient implementation of popular sequence models (e.g. Bart, ProphetNet) for text…

microsoft/PythonProgrammingPuzzles

This repo contains a dataset of python programming puzzles which can be used to teach and evaluate an AI's programming…

microsoft/xtreme-distil-transformers

Releasing [ XtremeDistilTransformers] with Tensorflow 2.3 and HuggingFace Transformers with an unified API with the…

p-lambda/swords

This repository houses the Stanford Word Substitution (Swords) benchmark. Swords ⚔️ is a benchmark for the task of…

Raldir/FEVEROUS

This repository maintains the code to generate and prepare the dataset, as well as the code of the annotation platform…

XinnuoXu/AugNLG

Code for paper " Xinnuo Xu, Guoyin Wang, Young-Bum Kim, Sungjin Lee AUGNLG: Few-shot Natural Language Generation using…

Dataset of the Week: FLORES

What is it?

Where is it?

facebookresearch/flores

FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages. Looking for FLORESv1, which…

Written by Ricky Costa