Analyzer Metamodule

Although FreeLing is a toolbox with a variety of modules to pick and choose from for specific uses, most applications are likely to need a text analisys pipeline that goes from raw text to certain level of annotation.

To ease the construction of applications that call FreeLing, the analyzer metamodule has been included. This metamodule implements a pipeline calling a standard sequence of FreeLing modules: Tokenizer, splitter, morphological analyzer, PoS tagger, sense annotation, WSD, NEC, parsing, SRL, coreferences.

A set of customizable options in this module allows the calling application to control the start and ending levels of the pipeline (e.g. from text to shallow parsing, or from tagged input to coreferences...), as well as which modules are turned on/off and with which configuration files are they loaded.

Thus, different instances of this class can be created using different option sets to get analyzers for different languages or different tasks.

The API of the class is the following:

class analyzer {
 public:
   typedef analyzer_config_options config_options;
   typedef analyzer_invoke_options invoke_options;

   /// constructor, given a set of creation options
   analyzer(const config_options &cfg);

   /// Destructor
   ~analyzer();

   /// get current execution options
   const invoke_options& get_current_invoke_options() const;

   /// change execution options for next call
   void set_current_invoke_options(const invoke_options &opt, bool check=true);

   /// analyze text as a whole document.
   /// 'parag' indicates whether a blank line is to be considered a paragraph 
   /// separator.
   void analyze(const wstring &text, document &doc, bool parag=false) const;

   /// Analyze text as a partial document. Retain incomplete sentences in buffer   
   /// in case next call completes them (except if flush==true)
   void analyze(const wstring &text, std::list<sentence> &ls, bool flush=false);

   /// analyze further levels on a partially analyzed document
   void analyze(document &doc) const;

   /// analyze further levels on partially analyzed sentences
   void analyze(std::list<sentence> &ls) const;

   /// flush splitter buffer and analyze any pending text. 
   void flush_buffer(std::list<sentence> &ls);

   /// Reset tokenizer byte offset counter to 0.
   void reset_offset();
};

The constructor expects a set of configuration options (see class analyzer::config_options below) which specifiy creation-time options for all modules that need to be loaded. These options are basically configuration and data files to load.

The analyzer meta-module will create all modules for which a configuration file is specified. If a module does not need to be created, the corresponding option in analyzer::config_options should be empty.

Once the analyzer instance is created, a set of invocation options must be specified (see description of class analyzer::invoke_options below) . Invocation options are run-time options and can be altered for each analysis if necessary (e.g. if one needs to apply different processes to different kinds of input texts). They include activating/deactivating modules or changing the initial/final points in the pipeline.

When invoke options are set, the analyzer meta-module can be called to process a plain text, or to enrich a partially analyzed document.

Analyzer configuration options

Class analyzer::config_options contains the configuration options that define which modules are active and which configuration files are loaded for each of them at construction time. Options in this set can not be altered once the analyzer is created. If an option has an empty value, the corresponding module will not be created (and thus it will not be possible to call it just altering invoke_options later)

class analyzer::config_options {
  public:
     /// Language of text to process
     std::wstring Lang;

     /// Tokenizer configuration file
     std::wstring TOK_TokenizerFile;

     /// Splitter configuration file
     std::wstring SPLIT_SplitterFile;

     /// Morphological analyzer options
     std::wstring MACO_Decimal, MACO_Thousand;
     std::wstring MACO_UserMapFile, MACO_LocutionsFile,   MACO_QuantitiesFile,
                  MACO_AffixFile,   MACO_ProbabilityFile, MACO_DictionaryFile, 
                  MACO_NPDataFile,  MACO_PunctuationFile, MACO_CompoundFile;     
     double MACO_ProbabilityThreshold;

     /// Phonetics config file
     std::wstring PHON_PhoneticsFile;

     /// NEC config file
     std::wstring NEC_NECFile;

     /// Sense annotator and WSD config files
     std::wstring SENSE_ConfigFile;
     std::wstring UKB_ConfigFile;

     /// Tagger options
     std::wstring TAGGER_HMMFile;
     std::wstring TAGGER_RelaxFile;
     int TAGGER_RelaxMaxIter;
     double TAGGER_RelaxScaleFactor;
     double TAGGER_RelaxEpsilon;
     bool TAGGER_Retokenize;
     ForceSelectStrategy TAGGER_ForceSelect;

     /// Chart parser config file
     std::wstring PARSER_GrammarFile;

     /// Dependency parsers config files
     std::wstring DEP_TxalaFile;   
     std::wstring DEP_TreelerFile;   

     /// Coreference resolution config file
     std::wstring COREF_CorefFile;

     /// semantic graph extractor config file
     std::wstring SEMGRAPH_SemGraphFile;
};

Analyzer invocation options

Class analyzer::invoke_options contains the options that define the behaviour of each module in the analyze on the all subsequent analysis (until invoke options are changed again) Options in this set can be altered after construction (e.g. to activate/deactivate certain modules), as many times as needed.

Values for this options need to be consistent with configuration options used to create the analyzer (e.g. it is not possible to activate a module that has not been loaded at creation time)

class analyzer_invoke_options {
   public:
     /// Level of analysis in input and output
     AnalysisLevel InputLevel, OutputLevel;

     /// activate/deactivate morphological analyzer modules
     bool MACO_UserMap, MACO_AffixAnalysis, MACO_MultiwordsDetection, 
          MACO_NumbersDetection, MACO_PunctuationDetection, 
          MACO_DatesDetection, MACO_QuantitiesDetection, 
          MACO_DictionarySearch, MACO_ProbabilityAssignment, MACO_CompoundAnalysis,
          MACO_NERecognition, MACO_RetokContractions;

     /// activate/deactivate phonetics
     bool PHON_Phonetics;

     /// activate/deactivate NEC
     bool NEC_NEClassification;

     /// Select which WSD to use (NO_WSD,ALL,MFS,UKB)
     WSDAlgorithm SENSE_WSD_which;

     /// Select which tagger to use (NO_TAGGER,HMM,RELAX)
     TaggerAlgorithm TAGGER_which;

     /// Select which dependency parser to use (NO_DEP,TXALA,TREELER)
     DependencyParser DEP_which;    
};