Language Identification and
Modelling in Specialized hardware
Introduction
Natural language processing
Tasks
Language identification
Is essential to using
WEB as corpus
Language Modelling
Challenge
Large data sizes
Solution
Use specialized Hardware
Graphics processing units - GPUs
Application
Neural Networks
Parsing
Field-programmable
gate arrays - FPGAs
Features
Fast
Customizable
Application
Encoding Grammars
Research
idea
Repurpose Network
security hardware
Application specific
integrated circuit for
network monitoring
Deterministic
pushdown
transducer
A finite state transducer (FST) is a finite
state automata (FSA) which produces
output as well as reading input - Finite
State Machine!
Recursive finite-domain programs can be
characterized by finite-state transducers
that are augmented with a pushdown store.
Such transducers are called pushdown
transducers.
Has a stack
Programmable
Executes
regular
expressions
POSIX
When matched
Outputs constant to CPU
Use stack
Push
Pop
Output matched span
Halt
No user
accessible
arithmetic
Applications
Reason
Do not easily
map to regular
expressions
Tasks
Language
Identification
Use model of Lui
and Baldwin, 2012
for 97 languages
Naive Bayes
model
Feature
strings are
converted to
literal regular
expressions
Collect
feature
counts
Emulate
automata
on CPU
Language
modelling
Using back-off models
Using Telescoping series
A telescoping series is any series where
nearly every term cancels with a
preceeding or following term. For
instance, the series
Sends just one value Q per token
and not probability and back offs
Saves CPU workload
and communications
Simplified query
For each word match as
much context as possible
Sends just one value Q per token
and not probability and back offs
Use greedy matching
Match as much
leading context as
possible
Scanning until a
match is found
Report the longest match
Resume scanning
The longest matching
N-gram will be reported
Use "greedy"
regular
expressions
Experiments
Performance Evaluation
One core
Language
identification
2.4 times faster
As the fastest
CPU programm
More details in
the paper
tested against
some existing
models
CLD2
C++
Original
Python
Java
Language
modelling
1.8 to 6 times faster
As CPU program
KenLM
DALM
Part of speech
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical
tagging or word-category disambiguation, is the process of marking up a word in a text (corpus)
as corresponding to a particular part of speech, based on both its definition, as well as its
context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A
simplified form of this is commonly taught to school-age children, in the identification of words
as nouns, verbs, adjectives, adverbs, etc.