lv.gotika.engine
Class GothicAnalyzer

java.lang.Object
  extended by lv.gotika.engine.GothicAnalyzer

public class GothicAnalyzer
extends Object

Analyzer for Latvian historical texts (originally written in Gothic ortography). Supports the following steps of analysis:

  1. Transliteration.
  2. Morphological analysis and lemmatization (with or without guessing).
  3. Verification of acquired lemmas against dictionaries.
  4. Fuzzy lemma-dictionary alignment.


Nested Class Summary
static class GothicAnalyzer.DictView
          A set of flags that indicate different views of the in-memory dictionary.
static class GothicAnalyzer.ResultTag
          A set of tags that indicate different steps of analysis process.
static class GothicAnalyzer.ResultView
          A set of flags that indicate different views (output streams) of the analysis results.
 
Field Summary
 Boolean DUPLICATES
          Do include references to all the sources per word while loading the in-memory dictionary?
 Boolean SYNONYMS
          Do extend analysis results with synonyms?
 Boolean SOUNDEX
          Do apply fuzzy search?
 
Constructor Summary
GothicAnalyzer(String conf, boolean duplicates)
          Loads all the dictionaries and the morphological analyzer.
GothicAnalyzer(String conf, boolean duplicates, boolean soundex, boolean syn)
          Loads all the dictionaries and the morphological analyzer.
 
Method Summary
 Properties analyzeText(Reader in)
          Processes a whole text: analyzes it word by word (context is not taken into account) and prints out results in one ore more output streams (in different formats), depending on the configuration of the analyzer: A stream for indexing purposes.
 Pair<Boolean,ArrayList<String>> analyzeWord(String word)
          Analyzes an individual word form (taken directly from a text).
 TreeMap<String,TreeSet<String>> extractSynonyms(Pair<Boolean,ArrayList<String>> results)
          Extracts synonyms from the result set returned by analyzeWord(String) and assigns them with the corresponding lemmas.
 ArrayList<Pair<String,String>> getBySoundex(String p)
          Searches for words in the in-memory dictionary that match with the given soundex pattern.
 ArrayList<Pair<String,String>> getSynonyms(String w)
          Searches for synonyms for the given word.
 boolean isOn(String view)
          Checks whether an output stream according to the specified result view is turned on.
 boolean isOnAny()
          Checks whether at least one output stream is turned on.
 ArrayList<String> lemmatize(String word, boolean guess)
          Finds potential lemmas for the given word form using the SemTi-Kamols morphological analyzer (http://www.semti-kamols.lv/).
 void printDictionary(String view, String file)
          Prints the in-memory dictionary into a file.
 ArrayList<Pair<String,String>> searchDirectly(String w)
          Searches for the given word in the in-memory dictionary as is.
 ArrayList<Pair<String,String>> searchFuzzy(String w)
          Searches for the given word in the in-memory dictionary in a fuzzy manner (a soundex pattern is created first).
 String soundex(String word)
          Creates a soundex pattern for the given word.
 String transliterate(String word)
          Transliterates the given word form from the Gothic to the contemporary ortography, as far as it can be done unambiguously.
 boolean turnOff(String view)
          Stop writing to an output stream according to the specified result view.
 void turnOffAll()
          Stop writing to all output streams.
 boolean turnOn(String view, Writer out)
          Sets an output stream according to the specified result view (format).
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DUPLICATES

public final Boolean DUPLICATES
Do include references to all the sources per word while loading the in-memory dictionary? If false only the first reference will be kept.


SOUNDEX

public Boolean SOUNDEX
Do apply fuzzy search?


SYNONYMS

public Boolean SYNONYMS
Do extend analysis results with synonyms?

Constructor Detail

GothicAnalyzer

public GothicAnalyzer(String conf,
                      boolean duplicates)
               throws Exception
Loads all the dictionaries and the morphological analyzer. Default parameters:

Parameters:
conf - configuration file containing references to the dictionaries.
duplicates - if true references to all sources per word will be stored; otherwise only the first one.
Throws:
Exception - error while reading some dictionary or unsuccessful initialization of the morphological analyzer.

GothicAnalyzer

public GothicAnalyzer(String conf,
                      boolean duplicates,
                      boolean soundex,
                      boolean syn)
               throws Exception
Loads all the dictionaries and the morphological analyzer.

Parameters:
conf - configuration file containing references to the dictionaries.
duplicates - if true references to all sources per word will be stored; otherwise only the first one.
soundex - do apply fuzzy search?
syn - do extend results with synonyms?
Throws:
Exception - error while reading some dictionary or unsuccessful initialization of the morphological analyzer.
Method Detail

turnOn

public boolean turnOn(String view,
                      Writer out)
               throws IllegalArgumentException
Sets an output stream according to the specified result view (format).

Parameters:
view - a flag that indicates the view.
out - output stream.
Returns:
true, if the requested output is already set; output stream is changed anyway.
Throws:
IllegalArgumentException - invalid flag or output stream is null.
See Also:
GothicAnalyzer.ResultView

turnOff

public boolean turnOff(String view)
                throws IllegalArgumentException
Stop writing to an output stream according to the specified result view. Previously opened stream is not closed().

Parameters:
view - a flag that indicates the view.
Returns:
false, if such an output has not been turned on.
Throws:
IllegalArgumentException - invalid flag.
See Also:
GothicAnalyzer.ResultView

turnOffAll

public void turnOffAll()
Stop writing to all output streams. Previously opened streams are not closed().


isOn

public boolean isOn(String view)
             throws IllegalArgumentException
Checks whether an output stream according to the specified result view is turned on.

Parameters:
view - a flag that indicates the view.
Returns:
true, if on; false otherwise.
Throws:
IllegalArgumentException - invalid flag.
See Also:
GothicAnalyzer.ResultView

isOnAny

public boolean isOnAny()
Checks whether at least one output stream is turned on.

Returns:
true, if there is some output stream set.

analyzeWord

public Pair<Boolean,ArrayList<String>> analyzeWord(String word)
                                            throws Exception
Analyzes an individual word form (taken directly from a text).

Parameters:
word - a word form.
Returns:
a pair that represents result set of the analysis process:
  1. the Boolean value indicates whether any approved lemma has been found for the given word form;
  2. the list contains all the suggestions for indexing; each result is formatted as follows: a sequence of all the "metamorphosis" of the given word form; items are separated with ResultTag.DELIMITER and a tag indicating the step of analysis is assigned to each of them.
Throws:
Exception - unsuccessful initialization of the morphological analyzer.
See Also:
GothicAnalyzer.ResultTag

analyzeText

public Properties analyzeText(Reader in)
                       throws Exception
Processes a whole text: analyzes it word by word (context is not taken into account) and prints out results in one ore more output streams (in different formats), depending on the configuration of the analyzer: Notes: One or more stream can be set simultaneously. If only a transliteration stream is turned on, further steps of analysis are not performed.

Parameters:
in - a text stream.
Returns:
statistics about the recognized, non-recognized and total number of words or null, if only transliteration was performed. Keys: recognized, unknown , total.
Throws:
Exception - unsuccessful initialization of the morphological analyzer or could not access some of the I/O streams, or no output stream is set.
See Also:
analyzeWord(String), turnOn(String, Writer)

lemmatize

public ArrayList<String> lemmatize(String word,
                                   boolean guess)
                            throws Exception
Finds potential lemmas for the given word form using the SemTi-Kamols morphological analyzer (http://www.semti-kamols.lv/).

Parameters:
word - a word form.
guess - if true and the word is not defined in the morphological lexicon, lemmas are guessed (suggestions should be verified in a dictionary).
Returns:
a list of lemmas.
Throws:
Exception

soundex

public String soundex(String word)
Creates a soundex pattern for the given word.

Parameters:
word - a word.
Returns:
a pattern.

transliterate

public String transliterate(String word)
Transliterates the given word form from the Gothic to the contemporary ortography, as far as it can be done unambiguously. The first letter may be either in lower or in upper case.

Parameters:
word - a word form.
Returns:
a transliteration equivalent.

extractSynonyms

public TreeMap<String,TreeSet<String>> extractSynonyms(Pair<Boolean,ArrayList<String>> results)
Extracts synonyms from the result set returned by analyzeWord(String) and assigns them with the corresponding lemmas.

Parameters:
results - a list of analysis results, formatted according to analyzeWord(String).
Returns:
sets of synonyms linked by head words.
See Also:
analyzeWord(String)

printDictionary

public void printDictionary(String view,
                            String file)
                     throws IOException
Prints the in-memory dictionary into a file. No sorting is performed.

Parameters:
view - a flag that indicates the view of interest.
file - destination filename.
Throws:
IOException - error while printing to the file.
IllegalArgumentException - invalid flag.
See Also:
GothicAnalyzer.DictView

searchDirectly

public ArrayList<Pair<String,String>> searchDirectly(String w)
Searches for the given word in the in-memory dictionary as is.

Parameters:
w - a word.
Returns:
a list of appropriate entries (typically one).

searchFuzzy

public ArrayList<Pair<String,String>> searchFuzzy(String w)
Searches for the given word in the in-memory dictionary in a fuzzy manner (a soundex pattern is created first).

Parameters:
w - a word.
Returns:
a list of appropriate entries.

getBySoundex

public ArrayList<Pair<String,String>> getBySoundex(String p)
Searches for words in the in-memory dictionary that match with the given soundex pattern.

Parameters:
p - a pattern.
Returns:
a list of corresponding words (along with references).
See Also:
soundex(String)

getSynonyms

public ArrayList<Pair<String,String>> getSynonyms(String w)
Searches for synonyms for the given word.

Parameters:
w - a word.
Returns:
a list of synonyms (along with references).