The Voyage of Life: Youth | Cole

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 06.13.21

Vintage

Geek Culture
Published in
7 min readJun 13, 2021

--

Welcome back! EleutherAI has a brand new (and big) GPT model that was open-sourced over this past week. The model (JAX-based) was trained for 5 weeks on the Pile dataset, Eleuther’s own ~800GB data dump. The model is called GPT-J, a 6 billion parameter model that rivals the performance of GPT-3 of the same size. And apparently it performs well on code generation:

Here’s a comparison of all the major language models on various datasets:

EleutherAI has a demo webpage for you to try out the model:

And a Colab for inference over TPUs 😁:

Want to thank Connected Papers for the shout-out this week! 😎

FYI, after the upcoming NLP Index update, we’ll pass the 6,000 repo mark! 🚀

TextStyleBrush

TextStyleBrush can recognize style of text in pictures and edit the words while maintaining the style.

It’s “… the first self-supervised AI model that replaces text in images of both handwriting and scenes — in one shot — using a single example word.”

Examples:

Getting Started with Tensorflow-Metal PluggableDevice

Install TensorFlow v2.5 and the tensorflow-metal PluggableDevice to accelerate training with Metal on Mac GPUs.

Do You Really Need Redis? How to Get Away with Just PostgreSQL

Chris Farber highlights how to use Postgres for common Redis use-cases. In all, he describes 3 use-cases of job-queuing, application locks, and pub/sub! Have to say, the pub/sub example was surprising:

Reasoning with Knowledge Graphs (Slides)

Goes over two papers:

Reasoning with Language Models and Knowledge Graphs for Question Answering https://arxiv.org/abs/2104.06378

Multi-hop logical reasoning on KGs https://arxiv.org/abs/2010.11465

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

Cross-Document Coreference Resolution

The first end-to-end model for cross-document (CD) coreference resolution from raw text, which extends the prominent model for withindocument coreference to the CD setting.

Connected Papers 📈

Text-to-SQL in the Wild

A dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website.

Connected Papers 📈

Framework for Evaluating Open-domain Chatbot Consistency

Addressing Inquiries about History (AIH) contains two stages: (1) during the inquiry stage, questions about the facts and opinions mentioned in the conversation history are inserted into the conversation between chatbots. (2) during the contradiction recognition stage, the responses of the inserted questions are collected, and automatic models or human judges can be adopted to decide whether the responses are consistent with the dialogue history.

Connected Papers 📈

SciFive: Text-to-Text Framework for Biomedical Literature

A domain-specific T5 model that has been pre-trained on large biomedical corpora. Model outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) on tasks in named entity relation, relation extraction, natural language inference, and question answering.

Connected Papers 📈

FastSeq: Sequence Model Library

Provides implementation of sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc. It automatically optimizes inference speed based on popular NLP toolkits (e.g. FairSeq and HuggingFace-Transformers) without accuracy loss.

Connected Papers 📈

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks

XtremeDistilTransformers comes with Tensorflow 2.3 and HuggingFace Transformers with an unified API with the following features:

  • Distil any supported pre-trained language models as teachers (e.g, Bert, Electra, Roberta)
  • Initialize student model with any pre-trained model (e.g, MiniLM, DistilBert, TinyBert), or initialize from scratch
  • Multilingual text classification and sequence tagging
  • Distil multiple hidden states from teacher
  • Distil deep attention networks from teacher
  • Pairwise and instance-level classification tasks (e.g, MNLI, MRPC, SST)
  • Progressive knowledge transfer with gradual unfreezing
  • Fast mixed precision training for distillation (e.g, mixed_float16, mixed_bfloat16)
  • ONNX runtime inference

Connected Papers 📈

Fact Extraction and VERification Over Unstructured and Structured information

Dataset consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict.

Connected Papers 📈

AUGNLG: Few-shot Natural Language Generation using Self-trained
Data Augmentation

A data augmentation approach that combines a self-trained neural retrieval model with a few-shot learned NLU model, to automatically create MR-to-Text data from open-domain texts.

Connected Papers 📈

Dataset of the Week: FLORES

What is it?

An evaluation benchmark for low-resource and multilingual machine translation. It’s a many-to-many multilingual translation benchmark dataset consisting of 3,001 sentences extracted from English Wikipedia and covering a variety of different topics and domains for 101 languages.

Where is it?

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

--

--

Ricky Costa
Geek Culture

Subscribe to the NLP Cypher newsletter for the latest in NLP & ML code/research. 🤟