Welcome back! We have a long newsletter this week as many new NLP repos were published as tech nerds return from their Summer vacation. 😁
This week I’ll add close to 150 new NLP repos to the NLP Index. So stay tuned for this update, it will drop this week.
Welcome to the Matrix
Six Degrees of Wikipedia
Six Degrees of Wikipedia
Find the shortest hyperlinked paths between any two pages on Wikipedia.
Embeddinghub is a database built for machine learning embeddings. It is built with four goals in mind.
- Store embeddings durably and with high availability
- Allow for approximate nearest neighbor operations
- Enable other operations like partitioning, sub-indices, and averaging
- Manage versioning, access control, and rollbacks painlessly
GitHub - featureform/embeddinghub: A storage engine for vector machine learning embeddings.
Embeddinghub is a database built for machine learning embeddings. It is built with four goals in mind. Store embeddings…
Rubrix | Open Sourced NLP Data Explorer/Annotator
This library is compatible with the usual suspects in NLP: Hugging Face Transformers, spaCy, Stanford Stanza, Flair etc.
- Monitor the predictions of deployed models.
- Collect ground-truth data for starting up a project or evolving an existing one.
- Iterate on ground-truth data and predictions to debug, track and improve your models over time.
- Build custom applications and dashboards on top of your model predictions and ground-truth data.
GitHub - recognai/rubrix: ✨A Python framework to explore, label, and monitor data for NLP projects
Python framework to explore, label, and monitor data for NLP Example: Named Entity Recognition data exploration and…
After 5 years, the survey is back.
Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100)…
"The One Hundred Year Study on Artificial Intelligence (AI100), launched in the fall of 2014, is a long-term…
Beyond “Vanilla” Question Answering
Deepset blog on how to enhance a QA model by adding more features such as classification, summarization, and generative QA.
Beyond ‘Vanilla’ Question Answering: Start Using Classification, Summarization, and Generative QA
Sentiment classification, summarization and even natural language generation can all be part of your question answering…
Papers to Read 📚
Mistakes Made in AWS
Learning from failure is more informative vs. learning from success.
Mistakes I've Made in AWS
I've been using AWS "professionally" since about 2015. In that time, I've made lots of mistakes. Other than…
New Models for Sentence Transformers
Nils Reimers on LinkedIn: 🚨Model Alert🚨 🏋️♂️ State-of-the-art sentence & paragraph embedding
🚨Model Alert🚨 🏋️♂️ State-of-the-art sentence & paragraph embedding models 🍻State-of-the-art semantic search models…
Comparing Language Identification Libraries
Get a major download of the leading text detection libraries. You get a comparison of accuracy, language coverage, speed and memory consumption.
Comparison of language identification models
Detecting the text language (often called language identification) is a common task when building machine learning…
Very handy collection of notebooks for every day data engineering tasks.
GitHub - jupyter-naas/awesome-notebooks: +100 awesome Jupyter Notebooks templates, organized by…
100 awesome Jupyter Notebooks templates, organized by tools, published by the Naas community, to kickstart your data…
CodeT5 from Salesforce on Hugging Face Model Hub
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
A model capable of general question answering, showing robustness outside the domains it was trained on. It has been trained in “multi-angle” fashion, which means it can handle a flexible set of input and output “slots” (like question, answer, explanation) . Built on top of T5.
GitHub - allenai/macaw: Multi-angle c(q)uestion answering
Macaw ( Multi- angle c(q)uestion ans w ering) is a ready-to-use model capable of general question answering, showing…
A technique that augments existing data to train better out-of-scope detectors operating in low-data regimes. GOLD generates pseudo-labeled candidates using samples from an auxiliary dataset and keeps only the most beneficial candidates for training through a novel filtering mechanism.
GitHub - asappresearch/gold: Official repository for "GOLD: Improving Out-of-Scope Detection in…
This respository contains the code and data for GOLD: Improving Out-of-Scope Detection in Dialogues using Data…
A framework based on graph neural networks and temporal commonsense knowledge to model global information and predict the relative order of sentences.
GitHub - declare-lab/sentence-ordering: This repository contains the PyTorch implementation of the…
This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense…
The Emory Language and Information Toolkit (ELIT) provides the state-of-the-art NLP models for the following tasks:
- Part-of-Speech Tagging
- Named Entity Recognition
- Constituency Parsing
- Dependency Parsing
- Semantic Role Labeling
- AMR Parsing
- Coreference Resolution
- Emotion Detection
GitHub - emorynlp/elit: Emory Langauge and Information Toolkit
The Emory Language and Information Toolkit (ELIT) provides the state-of-the-art NLP models for the following tasks…
A method for improving the zero-shot learning abilities of language models via instruction tuning.
GitHub - google-research/FLAN
Contribute to google-research/FLAN development by creating an account on GitHub.
GitHub - yonatanbitton/data_efficient_masked_language_modeling_for_vision_and_language: Repository…
Repository for the paper "Data Efficient Masked Language Modeling for Vision and Language", accepted to Findings of…
Extending the English GQA dataset to 7 typologically diverse languages for cross-lingual visual question answering.
GitHub - Adapter-Hub/xGQA
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering". xGQA builds…
AliceMind: ALIbaba’s Collection of Encoder-Decoders from MinD Lab
The family of AliceMind:
- Language understanding model: StructBERT (
- Generative language model: PALM (
- Cross-lingual language model: VECO (
- Cross-modal language model: StructVBERT (
CVPR 2020 VQA Challenge Runner-up)
- Structural language model: StructuralLM (
- Chinese language understanding model with multi-granularity inputs: LatticeBERT (
GitHub - alibaba/AliceMind
AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository…
Repo focusing on the wav2vec 2.0 model that formalizes several architecture designs that influence both the model performance and its efficiency.
GitHub - asappresearch/sew
The repo contains the code of the paper " Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech…
BIOLAMA benchmark is comprised of 49K biomedical factual knowledge triples for probing biomedical LMs.
GitHub - dmis-lab/BioLAMA: EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?
BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed…
TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multi-choice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in dialogue-state tracking.
GitHub - facebookresearch/Zero-Shot-DST: Zero-shot dialogue state tracking (DST)
This repository includes the implementation of the paper: Leveraging Slot Descriptions for Zero-Shot Cross-Domain…
BenchIE is a benchmark for measuring performance of Open Information Extraction (OIE) systems. Given manual annotations and a set of OIE extractions from different OIE systems, BenchIE measures precision, recall and F1 score based on our fact-based approach for evaluating OIE systems.
GitHub - gkiril/benchie: Comprehensive evaluation framework for Open Information Extraction.
BenchIE is a benchmark for measuring performance of Open Information Extraction (OIE) systems. Given manual annotations…
Open-source library for Box Embeddings and Box Representations, built on PyTorch & TensorFlow.
GitHub - iesl/box-embeddings: Box Embeddings as Modules
Open-source library for Box Embeddings and Box Representations, built on PyTorch & TensorFlow. The preferred way to…
A repo with a model for generating descriptions of fine-art paintings.
GitHub - noagarcia/explain-paintings: Repository for the data in the paper "Explain Me the…
This repository is for the annotated data in the paper Explain Me the Painting: Multi-TopicKnowledgeable Art…