FO
fourat-bs/TextNormalizer
Performs contextual, fully supervised text normalization
TextNormalizer
TextNormalizer is a string normalizer that uses SentenceTransformers as a backbone to obtain vector representations of sentences.
It's designed for repeated normalization task against a large corpus of strings.
The main contribution of TextNormalizer is to gain time by eliminating the need to compute the normalized strings embeddings every time.
Setup
pip install t-normalizer
Usage
- Create an instance of
TextNormalizer, can be initialized with aSentenceTransformermodel or aSentenceTransformermodel path. - Obtain the vector representation of the normalized string with
.fitmethod. - Transform the string to the most similar normalized form using the
.transformmethod.
from textnormalizer import TextNormalizer
normalizer = TextNormalizer()
normalized_text = ['senior software engineer', 'solutions architect', 'junior software developer']
to_normalize = ['experienced software engineer', 'software architect', 'entry level software engineer']
normalizer.fit(normalized_text)
transformed = normalizer.transform(to_normalize)
Serialization
The model along with the normalized strings and their vector representations can be saved and loaded with .save and .load methods.
# save
normalizer.save('path/to/model')
# load
model = TextNormalizer.load('path/to/model')