The Old Bridge | Robert

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 09.19.21

Vintage Vectors

7 min readSep 19, 2021

--

Welcome back! We have a long newsletter this week as many new NLP repos were published as tech nerds return from their Summer vacation. 😁

This week I’ll add close to 150 new NLP repos to the NLP Index. So stay tuned for this update, it will drop this week.

Welcome to the Matrix

Six Degrees of Wikipedia

just explore…

EmbeddingHub

Embeddinghub is a database built for machine learning embeddings. It is built with four goals in mind.

  • Store embeddings durably and with high availability
  • Allow for approximate nearest neighbor operations
  • Enable other operations like partitioning, sub-indices, and averaging
  • Manage versioning, access control, and rollbacks painlessly

Rubrix | Open Sourced NLP Data Explorer/Annotator

This library is compatible with the usual suspects in NLP: Hugging Face Transformers, spaCy, Stanford Stanza, Flair etc.

Rubrix can:

  • Monitor the predictions of deployed models.
  • Collect ground-truth data for starting up a project or evolving an existing one.
  • Iterate on ground-truth data and predictions to debug, track and improve your models over time.
  • Build custom applications and dashboards on top of your model predictions and ground-truth data.

AI100 Survey

After 5 years, the survey is back.

PDF

https://ai100.stanford.edu/sites/g/files/sbiybj18871/files/media/file/AI100Report_MT_10.pdf

Beyond “Vanilla” Question Answering

Deepset blog on how to enhance a QA model by adding more features such as classification, summarization, and generative QA.

Papers to Read 📚

https://arxiv.org/pdf/2102.01192.pdf
https://arxiv.org/pdf/2109.04422.pdf
https://assets.amazon.science/46/ea/020baefd4019bd7095417e02e350/voiser-a-new-benchmark-for-voice-based-search-refinement.pdf

Mistakes Made in AWS

Learning from failure is more informative vs. learning from success.

New Models for Sentence Transformers

Comparing Language Identification Libraries

Get a major download of the leading text detection libraries. You get a comparison of accuracy, language coverage, speed and memory consumption.

AWESOME NOTEBOOKS

Very handy collection of notebooks for every day data engineering tasks.

CodeT5 from Salesforce on Hugging Face Model Hub

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

Macaw | Multi-Angle C(Q)uestion Answering

A model capable of general question answering, showing robustness outside the domains it was trained on. It has been trained in “multi-angle” fashion, which means it can handle a flexible set of input and output “slots” (like question, answer, explanation) . Built on top of T5.

Connected Papers 📈

Generating Out-of-scope Labels with Data augmentation (GOLD)

A technique that augments existing data to train better out-of-scope detectors operating in low-data regimes. GOLD generates pseudo-labeled candidates using samples from an auxiliary dataset and keeps only the most beneficial candidates for training through a novel filtering mechanism.

Connected Papers 📈

AliceMind: ALIbaba’s Collection of Encoder-Decoders from MinD Lab

Repo contains:

The family of AliceMind:

  • Language understanding model: StructBERT (ICLR 2020)
  • Generative language model: PALM (EMNLP 2020)
  • Cross-lingual language model: VECO (ACL 2021)
  • Cross-modal language model: StructVBERT (CVPR 2020 VQA Challenge Runner-up)
  • Structural language model: StructuralLM (ACL 2021)
  • Chinese language understanding model with multi-granularity inputs: LatticeBERT (NAACL 2021)

Connected Papers 📈

SEW (Squeezed and Efficient Wav2vec)

Repo focusing on the wav2vec 2.0 model that formalizes several architecture designs that influence both the model performance and its efficiency.

Connected Papers 📈

Zero-Shot Dialogue State Tracking via Cross-Task Transfer

TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multi-choice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in dialogue-state tracking.

Connected Papers 📈

BenchIE: Benchmark for Open Information Extraction

BenchIE is a benchmark for measuring performance of Open Information Extraction (OIE) systems. Given manual annotations and a set of OIE extractions from different OIE systems, BenchIE measures precision, recall and F1 score based on our fact-based approach for evaluating OIE systems.

Connected Papers 📈

--

--

Ricky Costa

Subscribe to the NLP Cypher newsletter for the latest in NLP & ML code/research. 🤟