About pylazaro

What is pylazaro?

pylazaro is a Python library that automatically detects unassimilated lexical borrowings (or loanwords) in Spanish text, i.e. words from other languages that are used in Spanish without orthographic adaptation, such as app, lawfare, fake news or machine learning.

pylazaro focuses particularly on borrowings that come from English (a.k.a. anglicisms), although it can also detect some borrowings from other languages (such as Japanese, French or Basque).

How does pylazaro work?

pylazaro takes Spanish text as input an returns the borrowings found in the text. Borrowings from English will be labeled as en, borrowings from other languages will be labeled as other. What lies at the core of pylazaro is a machine learning model that has been trained for the task of detecting unassimilated lexical borrowings from Spanish newspapers.

pylazaro can be run with five different types of models:

  1. A BiLSTM-CRF model fed with subword embeddings and lexical embeddings pretrained on codeswitching data (this is the best performing model, and the default model used by pylazaro)

  2. A BiLSTM-CRF model fed with subword embeddings and bilingual Transformer-based Spanish-English lexical embeddings

  3. A Transformer model based on multilingual BERT

  4. A Transformer model based on Spanish model BETO

  5. A Conditional Random Field model with handcrafted features

By default, pylazaro uses the first model (BiLSTM-CRF with codeswitch embeddings), which is the best-performing model of all, but this can be modified at will (see How to use pylazaro).

For information about the creation of these models, training data and experimental results see the following paper:

What is the point of pylazaro package?

The models behind pylazaro (the BiLSTM-CRF and the Transformer-based model) have been publicly released and can already be accessed through HuggingFace modelhub. So one could ask what the point of pylazaro is. The purpose of pylazaro is to offer a single interface for all available models for borrowing detection.

Let’s say that we are using the Transformer-based model using the Transformers library, but we want to try the BiLSTM-CRF model , which produces better results. This would require changing all of our Python code and adapting it to Flair library, which is the library used by the BiLSTM-CRF model. This is a pain if we want to keep switching between models. And it will only get worse if new models based on different third-party packages are released. This scenario was precisely what I encountered doing my own PhD research. The purpose of pylazaro is therefore to offer a single interface for all borrowing detection models for Spanish that allows for switching between models smoothly.

In addition, using the Transformers library or Flair may be trivial for experienced programers, but it may not be that simple for novice Python users. pylazaro also seeks to offer an easy way to use these borrowing detection models for people who work on Linguistics and that may not be expert Python users.

I want to detect borrowings in Spanish text. Will pylazaro be suitable for my project?

Maybe. The models behind pylazaro have been trained and tuned for detecting a particular type of borrowing (unassimilated anglicisms) in a very specific setting (Spanish newspaper articles). If your use case is similar to that, pylazaro may be suitable.

But if you are looking to detect, let’s say, othographically adapted borrowings (such as fútbol or espóiler) or apply it to a very different type of text (such as tweets or other social media text) it may not work fine.

Also, bear in mind that all of these models are far from being perfect and they can easily make mistakes. The BiLSTM-CRF model (which is the best-performing model so far and the one that pylazaro uses by default) produced and F1 score of 85.76 during evaluation.

Where can I check the code, the models or the data behind pylazaro?

  • The code behind pylazaro is available on GitHub.

  • The dataset used to train pylazaro is the COALAS corpus.

  • The two best-performing models behind pylazaro are also available through HuggingFace model hub.

  • The paper that describes the creation of the dataset and models is available in the ACL anthology.

Why is it called pylazaro?

The name of this package (and of the whole project that motivates it) is a tribute to the Spanish linguist Lázaro Carreter, whose prescriptivist columns against the usage of the anglicisms in the Spanish press were extremely popular in Spain during the 1980s and the 1990s.

Who develops pylazaro?

pylazaro is built and maintained by Elena Álvarez Mellado, a (computational) linguist based in Spain.

How can I reach the maintainer?

If you have any questions, find any bugs or want to share anything with me, feel free to reach me at ealvarezmellado [@] gmail.com, open an issue on the GitHub repo or ping me on Twitter @lirondos.