Machine Learning for Language Detection in Python with scikit-learn



I recently played around with scraping public Telegram channels and finally want to do something with the data. In particular, I wanted to play around with Named Entity Recognition (NER) and Sentiment Analysis on the content of the Telegram posts. I think this will probably work much better when you do it for each language separately. So before we answer the question "what's this post about?" or "how angry is the person in the post?", let's answer the question "what language is the post in?" first. And we'll do it while warming up with that hole machine learning (ML) topic. Be aware: I have no idea what I'm doing. # Plan We will use the "European Parliament Proceedings Parallel Corpus 1996-2011" for training (you can download it at https://www.statmt.org/europarl/, its unpacked size is almost 5 GB). Random articles on the internet suggest that a "Multinomial Naive Bayes Classifier" could be a good choice for this use-case. In order to turn text into something the classifier can work with, we need to go through a process called "vectorization". Again, random articles on the internet suggest that counting letters could be a good idea. So the rough plan is: 1. Clean up data 2. Train a model and dump result to disk 3. Load model from disk and use it for classification # Data The europal archive contains one directory for each language: bg, cs, da, de, el, en, es, et, fi, fr, hu, it, lt, lv, nl, pl, pt, ro, sk, sl, sv. Each of these directories contain text files with text in the corresponding language. So the directory names will become the labels during training. Inspecting a few of the text files suggests that removing text fragments enclosed in < and > would be a good idea. ep-07-01-15-003.txt in bg for example has the following content:
<CHAPTER ID="003">
Състав на Парламента: вж. протоколи
We will be using the regex <[^<]+?> to get rid of those. Generally the dataset is encoded in UTF-8 (except for pl/ep-09-10-22-009.txt for some reason, but nothing that can't be fixed by errors='ignore').
r = re.compile('<[^<]+?>')
x = []
y = []

for label in os.listdir(input_directory):
    for file_name in os.listdir(os.path.join(input_directory, label)):
        y.append(label)
        with open(os.path.join(input_directory, label, file_name), 'rb') as fp:
            x.append(r.sub('', fp.read().decode('utf-8', errors='ignore')))
The result of all this are two lists called x and y of same length. x contains all texts and y contains all corresponding labels. This seems to be very common variable naming in ML land. # Vectorization & Training One might jump to writing code for counting letters now but this seems to be so stereotypical ML that scikit-learn comes with vectorizers for this. In particular we want the sklearn.feature_extraction.text.CountVectorizer. After vectorization, the list x will be called X (uppercase). Another common thing in ML is to randomly split this training set into the actual training data (X_train and y_train) as well as test data (X_test and y_test). This allows for testing the model's performance after training at the price of reducing the amount of data that can be used for training. To make the randomness of this split more deterministic it seems to be common to set the random seed to a fixed value. We'll use 33% of the data for testing:
cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
logger.info(f'Model-Score: {model.score(X_test, y_test)}')
dump(model, "model.joblib")
dump(cv, "vectorized.joblib")
Training took around 15 minutes on my machine and the score was 0.9797032429455406 which sounds like over 97% which sounds good I guess. The model dumped to disk has a size of 1.29GB and the vectorized data is a 82MB file. Reading the data from the input directory into memory took almost 20 minutes which sounds a bit too long but reading text from files is not the focus of this blag post. # Usage Now we can load the model from disk and evaluate it on some real-world data from the Telegram scraper.
model = load("model.joblib")
cv = load("vectorized.joblib")
for line in lines:
    data = cv.transform([line]).toarray()
    output = model.predict(data)
    print(output, line)
Loading the model, the vectorizer, and evaluating the model on 1000 lines took around 1 minute on my machine. Here a small selection of lines where I expected problems: * ['en'] WE WILL NOT COMPLY! * ['de'] 💙Wenn sie denken, dass sie irgendeinen Einfluss auf dich haben, du aber spirituell gefestigt bist und nichts dich gleichschalten oder zerstören kann. 💜 * ['de'] Ich bin jetzt grad mal wieder ein paar Stunden im Ahrtal und sehe und erfahre Mißstände und Zustande, die nur noch mit grenzenloser politischer Schande zu betiteln ist. * ['de'] Sie können mich direkt kontaktieren und bekommen alle Infos. * ['en'] pinned a video * ['de'] "Mikrowellenwaffen Wie die guten Jungs unliebsame Bürger aus dem Weg räumen - YouTube" * ['de'] Nix besonderes, stehe nur aufm parkplatz und mir is langweilig. * ['en'] So toll! 💜🙏👆👆👆 So all caps messages are not a problem, nor are emojis. That's good news. The following lines have mixed languages: * ['de'] Vorsicht vor Gatekeepern. ♦️❗☝️☝️☝️ * ['en'] Schau dir "The Animals "House Of The Rising Sun" on The Ed Sullivan Show" auf YouTube an * ['pt'] Schau dir "Menino de 3 anos toca bateria de forma incrível" auf YouTube an I would have classified the first line as German as well even though the word "Gatekeeper" is English. Similarly, I would have said that the second line is German but in some sense, there's more English in it. Google translates agrees that the third line is Portuguese btw. The following is a (probably) complete list of lines I would have classified as German but the model didn't: * ['et'] ❗️Lohnunternehmen Markus Wipperfürth * ['et'] Marc Ulrichs „“ _______________________ 19-08-2021_Einschub_01 👉 * ['et'] Pressdruck | Catherine, Frank und Marc #9 on Vimeo * ['fi'] SSS_#19_Die_Hintergründe_von_C_Was_vor_aller_Augen_weitgehend_unbedacht on Vimeo * ['it'] Corona Kult * ['lt'] Bei Altenburg * ['lt'] Greta 2065 😂 * ['lv'] Kirchsahr - Altenburg * ['pt'] Da isses. :o)☝️☝️☝️ * ['sv'] ❗️Azubi Wilhelm Hartmann Overall I'm very happy with the results though. Especially if you think about the fact that it _only_ takes number of letters into account. No fancy deep understanding of the language with nones and verbs and such. Edit: I pulled around 400.000 lines of text from the database, classified their language and it took about 6 hours (this is probably not saying a lot about the performance of the model though).

Leave a Reply

Your email address will not be published. Required fields are marked *