NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher | 06.13.21
Vintage
Welcome back! EleutherAI has a brand new (and big) GPT model that was open-sourced over this past week. The model (JAX-based) was trained for 5 weeks on the Pile dataset, Eleuther’s own ~800GB data dump. The model is called GPT-J, a 6 billion parameter model that rivals the performance of GPT-3 of the same size. And apparently it performs well on code generation:
Here’s a comparison of all the major language models on various datasets:
EleutherAI has a demo webpage for you to try out the model:
And a Colab for inference over TPUs 😁:
Want to thank Connected Papers for the shout-out this week! 😎
FYI, after the upcoming NLP Index update, we’ll pass the 6,000 repo mark! 🚀
TextStyleBrush
TextStyleBrush can recognize style of text in pictures and edit the words while maintaining the style.
It’s “… the first self-supervised AI model that replaces text in images of both handwriting and scenes — in one shot — using a single example word.”
Examples:
Getting Started with Tensorflow-Metal PluggableDevice
Install TensorFlow v2.5 and the tensorflow-metal PluggableDevice to accelerate training with Metal on Mac GPUs.
Do You Really Need Redis? How to Get Away with Just PostgreSQL
Chris Farber highlights how to use Postgres for common Redis use-cases. In all, he describes 3 use-cases of job-queuing, application locks, and pub/sub! Have to say, the pub/sub example was surprising:
Reasoning with Knowledge Graphs (Slides)
Goes over two papers:
Reasoning with Language Models and Knowledge Graphs for Question Answering https://arxiv.org/abs/2104.06378
Multi-hop logical reasoning on KGs https://arxiv.org/abs/2010.11465
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
Extractive Research Slide Generation Using Windowed Labeling Ranking
A method to automatically generate slides for scientific papers based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites.
Cross-Document Coreference Resolution
The first end-to-end model for cross-document (CD) coreference resolution from raw text, which extends the prominent model for withindocument coreference to the CD setting.
Stackoverflow Code Generation Using BART
A corpus of over 40,000 StackOverflow question texts to be used in conjunction with their corresponding intents from the CoNaLa dataset.
Colab:
Text-to-SQL in the Wild
A dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website.
Framework for Evaluating Open-domain Chatbot Consistency
Addressing Inquiries about History (AIH) contains two stages: (1) during the inquiry stage, questions about the facts and opinions mentioned in the conversation history are inserted into the conversation between chatbots. (2) during the contradiction recognition stage, the responses of the inserted questions are collected, and automatic models or human judges can be adopted to decide whether the responses are consistent with the dialogue history.
Few-Shot Intent Detection
Few-shot intent detection with/without Out-of-Scope (OOS) intents.
SciFive: Text-to-Text Framework for Biomedical Literature
A domain-specific T5 model that has been pre-trained on large biomedical corpora. Model outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) on tasks in named entity relation, relation extraction, natural language inference, and question answering.
FastSeq: Sequence Model Library
Provides implementation of sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc. It automatically optimizes inference speed based on popular NLP toolkits (e.g. FairSeq and HuggingFace-Transformers) without accuracy loss.
Python Programming Puzzles
A dataset of python programming puzzles which can be used to teach and evaluate an AI’s programming proficiency.
XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks
XtremeDistilTransformers comes with Tensorflow 2.3 and HuggingFace Transformers with an unified API with the following features:
- Distil any supported pre-trained language models as teachers (e.g, Bert, Electra, Roberta)
- Initialize student model with any pre-trained model (e.g, MiniLM, DistilBert, TinyBert), or initialize from scratch
- Multilingual text classification and sequence tagging
- Distil multiple hidden states from teacher
- Distil deep attention networks from teacher
- Pairwise and instance-level classification tasks (e.g, MNLI, MRPC, SST)
- Progressive knowledge transfer with gradual unfreezing
- Fast mixed precision training for distillation (e.g, mixed_float16, mixed_bfloat16)
- ONNX runtime inference
Swords ⚔️: Stanford Word Substitution Benchmark
A new benchmark for lexical substitution, the task of finding appropriate substitutes for a target word in a context.
Fact Extraction and VERification Over Unstructured and Structured information
Dataset consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict.
AUGNLG: Few-shot Natural Language Generation using Self-trained
Data Augmentation
A data augmentation approach that combines a self-trained neural retrieval model with a few-shot learned NLU model, to automatically create MR-to-Text data from open-domain texts.
Dataset of the Week: FLORES
What is it?
An evaluation benchmark for low-resource and multilingual machine translation. It’s a many-to-many multilingual translation benchmark dataset consisting of 3,001 sentences extracted from English Wikipedia and covering a variety of different topics and domains for 101 languages.
Where is it?
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat