Using <tt>analyzer</tt> Program to Process Corpora

Using `analyzer` Program to Process Corpora

The simplest way to use the FreeLing libraries is via the provided analyzer main program, which allows the user to process an input text to obtain several linguistic processings.

Since it is impossible to write a program that fits everyone's needs, analyzer offers you almost all functionalities included in FreeLing, but if you want it to output more information, or do so in a specific format, or combine the modules in a different way, the right path to follow is building your own main program or adapting one of the existing, as described in section Using the library from your own application.

The analyzer program is usually called with an option -f config-file (if ommitted, it will search for a file named analyzer.cfg in the current directory). The given config-file must be an absolute file name, or a relative path to the current directory.

You can use the default configuration files (located at /usr/local/share/freeling/config if you installed from tarball, or at /usr/share/freeling/config if you used a .deb package), or either a config file that suits your needs. Note that the default configuration files require the environment variable FREELINGSHARE to be defined and to point to a directory with valid FreeLing data files (e.g. /usr/local/share/freeling).

Environment variables are used for flexibility (e.g. avoid having to modify configuration files if you relocate your data files), but if you don't need them, you can replace all occurrences of FREELINGSHARE in your configuration files with a static path.

The analyzer program provides also a server mode (use option -server) which expects the input from a socket. The program analyzer_client can be used to read input files and send requests to the server. The advantatge is that the server remains loaded after analyzing each client's request, thus reducing the start-up overhead if many small files have to be processed. Client and server communicate via sockets. The client-server approach is also a good strategy to call FreeLing from a language or platform for which no API is provided: Just launch a server and use you preferred language to program a client that behaves like analyzer_client.

The analyze (no final "r") script described below handles all these default paths and variables and makes everything easier if you want to use the defaults.

The easy way: Using the `analyze` script

To ease the invocation of the program, a script named analyze (no final "r") is provided. This is script is able to locate default configuration files, define library search paths, and handle whether you want the client-server or the straight version.

The sample main program is called with the command:

analyze [-f config-file] [options]

If -f config-file is not specified, a file named analyzer.cfg is searched in the current working directory.

If -f config-file is specified but not found in the current directory, it will be searched in FreeLing installation directory, which is one of:

/usr/local/share/freeling/config if you installed from source
/usr/share/freeling/config if you used a binary .deb package).
myfreeling/share/freeling/config if you used --prefix=myfreeling option with ./configure.

Extra options may be specified in the command line to override any settings in config-file. See section Valid Options.

Stand-alone mode

The default mode will launch a stand-alone analyzer, which will load the configuration, read input from stdin, write results to stdout, and exit. E.g.:

analyze -f en.cfg <myinput >myoutput

When the input file ends, the analyzer will stop and it will have to be reloaded again to process a new file.

Client/server mode

If --server and --port options are specified, a server will be launched which starts listening for incoming requests. E.g.:

analyze -f en.cfg --server --port 50005 &

Once the server is launched, clients can request analysis to the server, with:

analyzer_client 50005 <myinput >myoutput
analyzer_client localhost:50005 <myinput >myoutput

or, from a remote machine:

analyzer_client my.server.com:50005 <myinput >myoutput
analyzer_client 192.168.10.11:50005 <myinput >myoutput

The server will fork a new process to attend each new client, so you can have many clients being served at the same time.

You can control the maximum amount of clients being attended simutaneously (in order to prevent a flood in your server) with the option --workers. You can control the size of the queue of pending clients with option --queue. Clients trying to connect when the queue is full will receive a connection error. See section Valid Options for details on these options.

Using a threaded analyzer

If libboost_thread is installed, the installation process will build the program threaded_analyzer. This program behaves like analyzer, and has almost the same options.

The program threaded_analyzer launches each processor in a separate thread, so while one sentece is being parsed, the next is being tagged, and the following one is running through the morphological analyzer. In this way, the multi-core capabilities of the host are better exploited and the analyzer runs faster.

Although it is intended mainly as an example for developers wanting to build their own threaded applications, this program can also be used to analyze texts, in the same way than analyzer.

Nevertheless, notice that this example program does not include modules that are not token- or sentence-oriented, namely, language identification and coreference resolution.

Usage example

Assuming we have the following input file mytext.txt:

El gato come pescado. Pero a Don Jaime no le gustan los gatos.

we could issue the command:

analyze -f myconfig.cfg <mytext.txt >mytext.mrf

Assuming that myconfig.cfg is the file presented in section Sample Configuration File, the produced output would correspond to a morfo output level (i.e. morphological analysis but no PoS tagging). The expected results are:

 El el DA0MS0 1 
 gato gato NCMS000 1 
 come comer VMIP3S0 0.75 comer VMM02S0 0.25 
 pescado pescado NCMS000 0.833333 pescar VMP00SM 0.166667 
 . . Fp 1 

 Pero pero CC 0.99878 pero NCMS000 0.00121951 Pero NP00000 0.00121951 
 a a NCFS000 0.0054008 a SPS00 0.994599 
 Don_Jaime Don_Jaime NP00000 1 
 no no NCMS000 0.00231911 no RN 0.997681 
 le él PP3CSD00 1 
 gustan gustar VMIP3P0 1 
 los el DA0MP0 0.975719 lo NCMP000 0.00019425 él PP3MPA00 0.024087 
 gatos gato NCMP000 1 
 . . Fp 1

If we also wanted PoS tagging, we could have issued the command:

analyze -f myconfig.cfg --outlv tagged <mytext.txt >mytext.tag

to obtain the tagged output:

 El el DA0MS0
 gato gato NCMS000
 come comer VMIP3S0
 pescado pescado NCMS000
 . . Fp

 Pero pero CC
 a a SPS00
 Don_Jaime Don_Jaime NP00000
 no no RN
 le él PP3CSD00
 gustan gustar VMIP3P0
 los el DA0MP0
 gatos gato NCMP000
 . . Fp

We can also ask for the senses of the tagged words:

analyze -f myconfig.cfg --outlv tagged --sense all <mytext.txt >mytext.sen

obtaining the output:

El el DA0MS0
 gato gato NCMS000 01630731:07221232:01631653
 come comer VMIP3S0 00794578:00793267
 pescado pescado NCMS000 05810856:02006311
 . . Fp

 Pero pero CC
 a a SPS00
 Don_Jaime Don_Jaime NP00000
 no no RN
 le él PP3CSD00
 gustan gustar VMIP3P0 01244897:01213391:01241953
 los el DA0MP0
 gatos gato NCMP000 01630731:07221232:01631653
 . . Fp

Alternatively, if we don't want to repeat the first steps that we had already performed, we could use the output of the morphological analyzer as input to the tagger:

analyze -f myconfig.cfg --inplv morfo --outlv tagged <mytext.mrf >mytext.tag

See options InputLevel, OutputLevel, InputFormat, and OutputFormat in section Valid options for details on which are valid input and output levels and formats.

Configuration File and Command Line Options

Almost all options may be specified either in the configuration file or in the command line, having the later precedence over the former.

Valid options are presented in section Valid options, both in their command-line and configuration file notations. Configuration files follow the usual linux standards. A sample file may be seen in section Sample Configuration File.

FreeLing package includes default configuration files. They can be found at the directory share/FreeLing/config under the FreeLing installation directory (/usr/local if you installed from source, and /usr/share/FreeLing if you used a binary .deb package). The analyze script will try to locate the configuration file in that directory if it is not found in the current working directory.

Valid Options

This section presents the options that can be given to the analyzer program (and thus, also to the analyzer_server program and to the analyze script). All options can be written in the configuration file as well as in the command line. The later has always precedence over the former.

Help

Command line	Configuration file
`-h`, `--help`, `--help-cf`	`N/A`

Prints to stdout a help screen with valid options and exits.
--help provides information about command line options.
--help-cf provides information about configuration file options.

Version number

Command line	Configuration file
`-v`, `--version`	`N/A`

Prints the version number of currently installed FreeLing library.

Configuration file

Command line	Configuration file
`-f <filename>`	`N/A`

Server mode

Command line	Configuration file
`--server`	`ServerMode=(yes/y/on/no/n/off)`

Activate server mode. Requires that option --port is also provided.
Default value is off.

Server Port Number

Command line	Configuration file
`-p <int>`, `--port <int>`	`ServerPort=<int>`

Specify port where server will be listening for requests. This option must be specified if server mode is active, and it is ignored if server mode is off.

Maximum Number of Server Workers

Command line	Configuration file
`-w <int>`, `--workers <int>`	`ServerMaxWorkers=<int>`

Specify maximum number of active workers that the server will launch. Each worker attends a client, so this is the maximum number of clients that are simultaneously attended. This option is ignored if server mode is off.

Default vaule is 5. Note that a high number of simultaneous workers will result in forking that many processes, which may overload the CPU and memory of your machine resulting in a system collapse.

When the maximum number of workers is reached, new incoming requests are queued until a worker finishes.

Maximum Size of Server Queue

Command line	Configuration file
`-q <int>`, `--queue <int>`	`ServerQueueSize=<int>`

Specify maximum number of pending clients that the server socket can hold. This option is ignored if server mode is off.

Pending clients are requests waiting for a worker to be available. They are queued in the operating system socket queue.

Default value is 32. Note that the operating system has an internal limit for the socket queue size (e.g. modern linux kernels set it to 128). If the given value is higher than the operating system limit, it will be ignored.

When the pending queue is full, new incoming requests get a connection error.

Trace Level

Command line	Configuration file
`-l <int>`, `--tlevel <int>`	`TraceLevel=<int>`

Set the trace level (0 = no trace, higher values = more trace), for debugging purposes.

This will work only if the library was compiled with tracing information, using ./configure -enable-traces. Note that the code with tracing information is slower than the code compiled without it, even when traces are not active.

Trace Module

Command line	Configuration file
`-m <mask>`, `--tmod <mask>`	`TraceModule=<mask>`

Specify modules to trace. Each module is identified with an hexadecimal flag. All flags may be OR-ed to specificy the set of modules to be traced.

Valid masks are defined in file src/include/freeling/morfo/traces.h, and are the following:

Module	Mask
Splitter	`0x00000001`
Tokenizer	`0x00000002`
Morphological analyzer	`0x00000004`
Language Identifier	`0x00000008`
Numbers detection	`0x00000010`
Date/time detection	`0x00000020`
Punctuation	`0x00000040`
Dictionary	`0x00000080`
Affixes	`0x00000100`
Multiwords	`0x00000200`
NE Recognition	`0x00000400`
Probabilities	`0x00000800`
Quantities detection	`0x00001000`
NE Classification	`0x00002000`
Automat (abstract)	`0x00004000`
PoS Tagger	`0x00008000`
Sense annotation	`0x00010000`
Chart parser	`0x00020000`
Chart grammar	`0x00040000`
Dependency parser	`0x00080000`
Coreference resolution	`0x00100000`
Basic utilities	`0x00200000`
WSD	`0x00400000`
Alternatives	`0x00800000`
Database access	`0x01000000`
Feature Extraction	`0x02000000`
Machine Learning modules	`0x04000000`
Phonetic encoding	`0x08000000`
Mention detection	`0x10000000`
Input/Output	`0x20000000`
Semantic graph extraction	`0x40000000`
Summarizer	`0x80000000`

Language of input text

Command line	Configuration file
`--lang <language>`	`Lang=<language>`

Code for language of input text. Though it is not required, the convention is to use two-letter ISO codes (as: Asturian, es: Spanish, ca: Catalan, en: English, cy: Welsh, it: Italian, gl: Galician, pt: Portuguese, ru: Russian, old-es: old Spanish, etc).

Other languages may be added to the library. See chapter Adding Support for New Languages for details.

Locale

Command line	Configuration file
`--locale <locale>`	`Locale=<locale>`

Locale to be used to interpret both input text and data files. Usually, the value will match the locale of the Lang option (e.g. es_ES.utf8 for spanish, ca_ES.utf8 for Catalan, etc.). The values default (stands for en_US.utf8) and system (stands for currently active system locale) may also be used.

Splitter Buffer Flushing

Command line	Configuration file
`--flush`, `--noflush`	`AlwaysFlush=(yes/y/on/no/n/off)`

When this option is inactive (most usual choice) sentence splitter buffers lines until a sentence marker is found. Then, it outputs a complete sentence.

When this option is active, the splitter never buffers any token, and considers each newline as a sentence end, thus processing each line as an independent sentence.

Input Format

Command line	Configuration file
`--input <string>`	`InputFormat=<string>`

Input format in which to expect text to analyze.

Valid values are:
* text: Plain text. * freeling: pseudo-column format produced by freeling with output level morfo or tagged. * conll: CoNLL-like column format.

Input CoNLL format definition file

Command line	Configuration file
`--iconll <filename>`	`InputConllConfig=<filename>`

Configuration file for input CoNLL format. Defines which columns --and in which order-- must be read. See section Input/Output Handling Modules for details on the file format.

This option is valid only when InputFormat=conll. Otherwise, it is ignored.

Output Format

Command line	Configuration file
`--output <string>`	`OutputFormat=<string>`

Output format to produce with analysis results.

Valid values are:

freeling: Classical freeling format. It may be a pseudo-column for with output levels morfo or tagged, parenthesized trees for parsing output, or other human-readable output for coreferences or semantic graph output.
conll: CoNLL-like column format.
xml: FreeLing-specific XML format.
json: JSON format
naf: XML format following NAF conventions (see https://github.com/newsreader/NAF)
train: Produce freeling pseudo-column format suitable to train PoS taggers. This option can be used to annotate a corpus, correct the output manually, and use it to retrain the taggers with the script src/utilities/train-tagger/bin/TRAIN.sh provided in FreeLing package. See src/utilities/train-tagger/README for details about how to use it.

Output CoNLL format definition file

Command line	Configuration file
`--oconll <filename>`	`OutputConllConfig=<filename>`

Configuration file for out CoNLL format. Defines which columns --and in which order-- must be written. See section Input/Output Handling Modules for details on the file format.

This option is valid only when OutputFormat=conll. Otherwise, it is ignored.

Input Level

Command line	Configuration file
`--inplv <string>`	`InputLevel=<string>`

Analysis level of input data (plain, token, splitted, morfo, tagged, shallow, dep, coref).

plain: plain text.
token: tokenized text (one token per line).
splitted : tokenized and sentence-splitted text (one token per line, sentences separated with one blank line).
morfo: tokenized, sentence-splitted, and morphologically analyzed text. One token per line, sentences separated with one blank line. Each line has the format: word (lemma tag prob)⁺
tagged: tokenized, sentence-splitted, morphologically analyzed, and PoS-tagged text. One token per line, sentences separated with one blank line. Each line has the format: word lemma tag.
shallow: the previous plus constituency parsing. Only valid with InputFormat=conll.
dep: the previous plus dependency parsing (may include constituents or not. May include also SRL). Only valid with InputFormat=conll.
coref: the previous plus coreference. Only valid with InputFormat=conll.

Output Level

Command line	Configuration file
`--outlv <string>`	`OutputLevel=<string>`

Analysis level of output data (ident, token, splitted, morfo, tagged, shallow, dep, coref, semgraph).

ident: perform language identification instead of analysis.
token: tokenized text (one token per line).
splitted : tokenized and sentence-splitted text (one token per line, sentences separated with one blank line).
morfo: tokenized, sentence-splitted, and morphologically analyzed text. One token per line, sentences separated with one blank line.
tagged: tokenized, sentence-splitted, morphologically analyzed, and PoS-tagged text. One token per line, sentences separated with one blank line.
shallow: tokenized, sentence-splitted, morphologically analyzed, PoS-tagged, optionally sense-annotated, and shallow-parsed text, produced by the chart_parser module.
parsed: tokenized, sentence-splitted, morphologically analyzed, PoS-tagged, optionally sense-annotated, and full-parsed text, as output by the first stage (tree completion) of the rule-based dependency parser.
dep: tokenized, sentence-splitted, morphologically analyzed, PoS-tagged, optionally sense-annotated, and dependency-parsed text, as output by the second stage (transformation to dependencies and function labelling) of the dependency parser. May include also SRL if the statistical parser is used (and SRL is available for the input language).
coref: the previous plus coreference.
semgraph: the previous plus semantic graph. Only valid with OutputFormat=xml|json|freeling.

Language Identification Configuration File

Command line	Configuration file
`-I <filename>`, `--fidn <filename>`	`N/A`

Configuration file for language identifier.

Tokenizer File

Command line	Configuration file
`--ftok <filename>`	`TokenizerFile=<filename>`

File of tokenization rules.

Splitter File

Command line	Configuration file
`--fsplit <filename>`	`SplitterFile=<filename>`

File of splitter rules.

Affix Analysis

Command line	Configuration file
`--afx`, `--noafx`	`AffixAnalysis=(yes/y/on/no/n/off)`

Whether to perform affix analysis on unknown words. Affix analysis applies a set of affixation rules to the word to check whether it is a derived form of a known word.

Affixation Rules File

Command line	Configuration file
`-S <filename>`, `--fafx <filename>`	`AffixFile=<filename>`

Affix rules file, used by dictionary module.

User Map

Command line	Configuration file
`--usr`, `--nousr`	`UserMap=(yes/y/on/no/n/off)`

Whether to apply or not a file of customized word-tag mappings.

User Map File

Command line	Configuration file
`-M <filename>`, `--fmap <filename>`	`UserMapFile=<filename>`

User Map file to be used.

Multiword Detection

Command line	Configuration file
`--loc`, `--noloc`	`MultiwordsDetection=(yes/y/on/no/n/off)`

Whether to perform multiword detection. This option requires that a multiword file is provided.

Multiword File

Command line	Configuration file
`-L <filename>`, `--floc <filename>`	`LocutionsFile=<filename>`

Multiword definition file.

Number Detection

Command line	Configuration file
`--numb`, `--nonumb`	`NumbersDetection=(yes/y/on/no/n/off)`

Whether to perform nummerical expression detection. Deactivating this feature will affect the behaviour of date/time and ratio/currency detection modules.

Decimal Point

Command line	Configuration file
`--dec <string>`	`DecimalPoint=<string>`

Specify decimal point character for the number detection module (for instance, in English is a dot, but in Spanish is a comma).

Thousand Point

Command line	Configuration file
`--thou <string>`	`ThousandPoint=<string>`

Specify thousand point character for the number detection module (for instance, in English is a comma, but in Spanish is a dot).

Punctuation Detection

Command line	Configuration file
`--punt`, `--nopunt`	`PunctuationDetection=(yes/y/on/no/n/off)`

Whether to assign PoS tag to punctuation signs.

Punctuation Detection File

Command line	Configuration file
`-F <filename>`, `--fpunct <filename>`	`PunctuationFile=<filename>`

Punctuation symbols file.

Date Detection

Command line	Configuration file
`--date`, `--nodate`	`DatesDetection=(yes/y/on/no/n/off)`

Whether to perform date and time expression detection.

Quantities Detection

Command line	Configuration file
`--quant`, `--noquant`	`QuantitiesDetection=(yes/y/on/no/n/off)`

Whether to perform currency amounts, physical magnitudes, and ratio detection.

Quantity Recognition File

Command line	Configuration file
`-Q <filename>`, `--fqty <filename>`	`QuantitiesFile=<filename>`

Quantitiy recognition configuration file.

Dictionary Search

Command line	Configuration file
`--dict`, `--nodict`	`DictionarySearch=(yes/y/on/no/n/off)`

Whether to search word forms in dictionary. Deactivating this feature also deactivates AffixAnalysis option.

Dictionary File

Command line	Configuration file
`-D <filename>`, `--fdict <filename>`	`DictionaryFile=<filename>`

Dictionary database.

Probability Assignment

Command line	Configuration file
`--prob`, `--noprob`	`ProbabilityAssignment=(yes/y/on/no/n/off)`

Whether to compute a lexical probability for each tag of each word. Deactivating this feature will affect the behaviour of the PoS tagger.

Lexical Probabilities File

Command line	Configuration file
`-P <filename>`, `--fprob <filename>`	`ProbabilityFile=<filename>`

Lexical probabilities file. The probabilities in this file are used to compute the most likely tag for a word, as well to estimate the likely tags for unknown words.

Unknown Words Probability Threshold.

Command line	Configuration file
`-e <float>`, `--thres <float>`	`ProbabilityThreshold=<float>`

Threshold that must be reached by the probability of a tag given the suffix of an unknown word in order to be included in the list of possible tags for that word. Default is zero (all tags are included in the list). A non-zero value (e.g. 0.0001, 0.001) is recommended.

Named Entity Recognition

Command line	Configuration file
`--ner`, `--noner`	`NERecognition=(yes/y/on/no/n/off)`

Whether to perform NE recognition.

Named Entity Recognizer File

Command line	Configuration file
`-N <filename>`, `--fnp <filename>`	`NPDataFile=<filename>`

Configuration data file for NE recognizer.

Named Entity Classification

Command line	Configuration file
`--nec`, `--nonec`	`NEClassification=(yes/y/on/no/n/off)`

Whether to perform NE classification.

Named Entity Classifier File

Command line	Configuration file
`--fnec <filename>`	`NECFile=<filename>`

Configuration file for Named Entity Classifier module

Phonetic Encoding

Command line	Configuration file
`--phon`, `--nophon`	`Phonetics=(yes/y/on/no/n/off)`

Whether to add phonetic transcription to each word.

Phonetic Encoder File

Command line	Configuration file
`--fphon <filename>`	`PhoneticsFile=<filename>`

Configuration file for phonetic encoding module

Sense Annotation

Command line	Configuration file
`-s <string>`, `--sense <string>`	`SenseAnnotation=<string>`

Kind of sense annotation to perform

no, none: Deactivate sense annotation.
all: annotate with all possible senses in sense dictionary.
mfs: annotate with most frequent sense.
ukb: annotate all senses, ranked by UKB algorithm.

Whether to perform sense anotation.

If active, the PoS tag selected by the tagger for each word is enriched with a list of all its possible WN synsets. The sense repository used depends on the options Sense Annotation Configuration File'' andUKB Word Sense Disambiguator Configuration File'' described below.

Sense Annotation Configuration File

Command line	Configuration file
`-W <filename>`, `--fsense <filename>`	`SenseConfigFile=<filename>`

Word sense annotator configuration file.

UKB Word Sense Disambiguator Configuration File

Command line	Configuration file
`-U <filename>`, `--fukb <filename>`	`UKBConfigFile=<filename>`

UKB configuration file.

Tagger algorithm

Command line	Configuration file
`-t <string>`, `--tag <string>`	`Tagger=<string>`

Algorithm to use for PoS tagging

hmm: Hidden Markov Model tagger, based on [Bra00].
relax: Relaxation Labelling tagger, based on [Pad98].

HMM Tagger configuration File

Command line	Configuration file
`-H <filename>`, `--hmm <filename>`	`TaggerHMMFile=<filename>`

Parameters file for HMM tagger.

Relaxation labelling tagger constraints file

Command line	Configuration file
`-R <filename>`, `--rlx <filename>`	`TaggerRelaxFile=<filename>`

File containing the constraints to apply to solve the PoS tagging.

Relaxation labelling tagger iteration limit

Command line	Configuration file
`-i <int>`, `--iter <int>`	`TaggerRelaxMaxIter=<int>`

Maximum numbers of iterations to perform in case relaxation does not converge.

Relaxation labelling tagger scale factor

Command line	Configuration file
`-r <float>`, `--sf <float>`	`TaggerRelaxScaleFactor=<float>`

Scale factor to normalize supports inside RL algorithm. It is comparable to the step lenght in a hill-climbing algorithm: The larger scale factor, the smaller step.

Relaxation labelling tagger epsilon value

Command line	Configuration file
`--eps <float>`	`TaggerRelaxEpsilon=<float>`

Real value used to determine when a relaxation labelling iteration has produced no significant changes. The algorithm stops when no weight has changed above the specified epsilon.

Retokenize contractions in dictionary

Command line	Configuration file
`--rtkcon`, `--nortkcon`	`RetokContractions=(yes/y/on/no/n/off)`

Specifies whether the dictionary must retokenize contractions when found, or leave the decision to the TaggerRetokenize option.

Note that if this option is active, contractions will be retokenized even if the TaggerRetokenize option is not active. If this option is not active, contractions will be retokenized depending on the value of the TaggerRetokenize option.

Retokenize after tagging

Command line	Configuration file
`--rtk`, `--nortk`	`TaggerRetokenize=(yes/y/on/no/n/off)`

Determine whether the tagger must perform retokenization after the appropriate analysis has been selected for each word. This is closely related to affix analysis and PoS taggers.

Force the selection of one unique tag

Command line	Configuration file
`--force <string>`	`TaggerForceSelect=(none/tagger/retok)`

Determine whether the tagger must be forced to (probably randomly) make a unique choice and when.

none: Do not force the tagger, allow ambiguous output.
tagger: Force the tagger to choose before retokenization (i.e. if retokenization introduces any ambiguity, it will be present in the final output).
retok: Force the tagger to choose after retokenization (no remaining ambiguity)

Chart Parser Grammar File

Command line	Configuration file
`-G <filename>`, `--grammar <filename>`	`GrammarFile=<filename>`

This file contains a CFG grammar for the chart parser, and some directives to control which chart edges are selected to build the final tree.

Dependency Parser Rule File

Command line	Configuration file
`-T <filename>`, `--txala <filename>`	`DepTxalaFile==<filename>`

Rules to be used to perform rule-based dependency analysis.

Statistical Dependency Parser File

Command line	Configuration file
`-E <filename>`, `--treeler <filename>`	`DepTreelerFile==<filename>`

Configuration file for statistical dependency parser and SRL module

Dependency Parser Selection

Command line	Configuration file
`-d <string>`, `--dep <string>`	`DependencyParser==<string>`

Which dependency parser to use. Valid values are: * txala (rule-based) * treeler (statistical, may have SRL also).

Coreference Resolution File

Command line	Configuration file
`-C <filename>`, `--fcorf <filename>`	`CorefFile=<filename>`

Configuration file for coreference resolution module.

Sample Configuration File

A sample configuration file follows. You can start using freeling with the default configuration files which are installed at /usr/local/share/freeling/config (note than prefix /usr/local may differ if you specified an alternative location when installing FreeLing. If you installed from a binary .deb package), it will be at /usr/share/freeling/config.

You can use those files as a starting point to customize one configuration file to suit your needs.

Note that file paths in the sample configuration file contain $FREELINGSHARE, which is supposed to be an environment variable. If this variable is not defined, the analyzer will abort, complaining about not finding the files.

If you use the analyze script, it will define the variable for you as /usr/local/share/Freeling (or the right installation path), unless you define it to point somewhere else.

You can also adjust your configuration files to use normal paths for the files (either relative or absolute) instead of using variables.

##
#### default configuration file for Spanish analyzer
##

#### General options 
Lang=es
Locale=default

### Tagset description file, used by different modules
TagsetFile=$FREELINGSHARE/es/tagset.dat

## Traces (deactivated)
TraceLevel=0
TraceModule=0x0000

## Options to control the applied modules. The input may be partially
## processed, or not a full analysis may me wanted. The specific 
## formats are a choice of the main program using the library, as well
## as the responsability of calling only the required modules.
InputLevel=text
OutputLevel=morfo

# Do not consider each newline as a sentence end
AlwaysFlush=no

#### Tokenizer options
TokenizerFile=$FREELINGSHARE/es/tokenizer.dat

#### Splitter options
SplitterFile=$FREELINGSHARE/es/splitter.dat

#### Morfo options
AffixAnalysis=yes
CompoundAnalysis=yes
MultiwordsDetection=yes
NumbersDetection=yes
PunctuationDetection=yes
DatesDetection=yes
QuantitiesDetection=yes
DictionarySearch=yes
ProbabilityAssignment=yes
DecimalPoint=,
ThousandPoint=.
LocutionsFile=$FREELINGSHARE/es/locucions.dat 
QuantitiesFile=$FREELINGSHARE/es/quantities.dat
AffixFile=$FREELINGSHARE/es/afixos.dat
CompoundFile=$FREELINGSHARE/es/compounds.dat
ProbabilityFile=$FREELINGSHARE/es/probabilitats.dat
DictionaryFile=$FREELINGSHARE/es/dicc.src
PunctuationFile=$FREELINGSHARE/common/punct.dat
ProbabilityThreshold=0.001

# NER options 
NERecognition=yes
NPDataFile=$FREELINGSHARE/es/np.dat
## comment line above and uncomment one of those below, if you want 
## a better NE recognizer (higer accuracy, lower speed)
#NPDataFile=$FREELINGSHARE/es/nerc/ner/ner-ab-poor1.dat
#NPDataFile=$FREELINGSHARE/es/nerc/ner/ner-ab-rich.dat
# "rich" model is trained with rich gazetteer. Offers higher accuracy but 
# requires adapting gazetteer files to have high coverage on target corpus.
# "poor1" model is trained with poor gazetteer. Accuracy is splightly lower
# but suffers small accuracy loss the gazetteer has low coverage in target corpus.
# If in doubt, use "poor1" model.

## Phonetic encoding of words.
Phonetics=no
PhoneticsFile=$FREELINGSHARE/es/phonetics.dat

## NEC options. See README in common/nec
NEClassification=no
NECFile=$FREELINGSHARE/es/nerc/nec/nec-ab-poor1.dat
#NECFile=$FREELINGSHARE/es/nerc/nec/nec-ab-rich.dat

## Sense annotation options (none,all,mfs,ukb)
SenseAnnotation=none
SenseConfigFile=$FREELINGSHARE/es/senses.dat
UKBConfigFile=$FREELINGSHARE/es/ukb.dat

#### Tagger options
Tagger=hmm
TaggerHMMFile=$FREELINGSHARE/es/tagger.dat
TaggerRelaxFile=$FREELINGSHARE/es/constr_gram-B.dat
TaggerRelaxMaxIter=500
TaggerRelaxScaleFactor=670.0
TaggerRelaxEpsilon=0.001
TaggerRetokenize=yes
TaggerForceSelect=tagger

#### Parser options
GrammarFile=$FREELINGSHARE/es/chunker/grammar-chunk.dat

#### Dependence Parser options
DependencyParser=txala
DepTxalaFile=$FREELINGSHARE/es/dep_txala/dependences.dat
DepTreelerFile=$FREELINGSHARE/es/dep_treeler/dependences.dat

#### Coreference Solver options
CorefFile=$FREELINGSHARE/es/coref/relaxcor/relaxcor.dat
SemGraphExtractorFile=$FREELINGSHARE/es/semgraph/semgraph-SRL.dat

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Using analyzer Program to Process Corpora

The easy way: Using the analyze script