Language Identification and Modelling in Specialized hardware

Introduction
1. Natural language processing
  1. Tasks
    1. Language identification
      1. Is essential to using WEB as corpus
    2. Language Modelling
  2. Challenge
    1. Large data sizes
    2. Solution
      1. Use specialized Hardware
        Graphics processing units - GPUs
        Application
        Neural Networks
        Parsing
        Field-programmable gate arrays - FPGAs
        Features
        Fast
        Customizable
        Application
        Encoding Grammars
Research
1. idea
  1. Repurpose Network security hardware
    1. Application specific integrated circuit for network monitoring
      1. Deterministic pushdown transducer
        A finite state transducer (FST) is a finite state automata (FSA) which produces output as well as reading input - Finite State Machine!
        Recursive finite-domain programs can be characterized by finite-state transducers that are augmented with a pushdown store. Such transducers are called pushdown transducers.
      2. Has a stack
      3. Programmable
        Executes regular expressions
        POSIX
        When matched
        Outputs constant to CPU
        Use stack
        Push
        Pop
        Output matched span
        Halt
      4. No user accessible arithmetic
    2. Applications
      1. Reason
        Do not easily map to regular expressions
      2. Tasks
        Language Identification
        Use model of Lui and Baldwin, 2012 for 97 languages
        Naive Bayes model
        Feature strings are converted to literal regular expressions
        Collect feature counts
        Emulate automata on CPU
        Language modelling
        Using back-off models
        Using Telescoping series
        A telescoping series is any series where nearly every term cancels with a preceeding or following term. For instance, the series
        Annotations:
        https://en.wikipedia.org/wiki/Telescoping_series
        http://math.oregonstate.edu/home/programs/undergrad/CalculusQuestStudyGuides/SandS/SeriesTests/telescoping.html
        Collapse prob and back off into single function
        Preserves sentence-level probabilities
        Sends just one value Q per token and not probability and back offs
        Saves CPU workload and communications
        Simplified query
        For each word match as much context as possible
        Sends just one value Q per token and not probability and back offs
        Use greedy matching
        Match as much leading context as possible
        Scanning until a match is found
        Report the longest match
        Resume scanning
        The longest matching N-gram will be reported
        Use "greedy" regular expressions
2. Experiments
  1. Performance Evaluation
    1. One core
      1. Language identification
        2.4 times faster
        As the fastest CPU programm
        More details in the paper
        tested against some existing models
        CLD2
        C++
        Original
        Python
        Java
      2. Language modelling
        1.8 to 6 times faster
        As CPU program
        KenLM
        DALM
        Part of speech
        In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
        Annotations:
        https://en.wikipedia.org/wiki/Part-of-speech_tagging
  2. Tarari T2540 PCI express
    1. Controlled by 1-thread CPU program
      1. Performed arithmetic
      2. Scalable till 4 devices

Next up

Language Identification and Modelling in Specialized hardware

Description

Resource summary

Similar

	Created by Ivan Zapreev over 9 years ago