Using analyzer Program to Process Corpora

The simplest way to use the FreeLing libraries is via the provided analyzer main program, which allows the user to process an input text to obtain several linguistic processings.

Since it is impossible to write a program that fits everyone's needs, analyzer offers you almost all functionalities included in FreeLing, but if you want it to output more information, or do so in a specific format, or combine the modules in a different way, the right path to follow is building your own main program or adapting one of the existing, as described in section Using the library from your own application.

The analyzer program is usually called with an option -f config-file (if ommitted, it will search for a file named analyzer.cfg in the current directory). The given config-file must be an absolute file name, or a relative path to the current directory.

You can use the default configuration files (located at /usr/local/share/freeling/config if you installed from tarball, or at /usr/share/freeling/config if you used a .deb package), or either a config file that suits your needs. Note that the default configuration files require the environment variable FREELINGSHARE to be defined and to point to a directory with valid FreeLing data files (e.g. /usr/local/share/freeling).

Environment variables are used for flexibility (e.g. avoid having to modify configuration files if you relocate your data files), but if you don't need them, you can replace all occurrences of FREELINGSHARE in your configuration files with a static path.

The analyzer program provides also a server mode (use option -server) which expects the input from a socket. The program analyzer_client can be used to read input files and send requests to the server. The advantatge is that the server remains loaded after analyzing each client's request, thus reducing the start-up overhead if many small files have to be processed. Client and server communicate via sockets. The client-server approach is also a good strategy to call FreeLing from a language or platform for which no API is provided: Just launch a server and use you preferred language to program a client that behaves like analyzer_client.

The analyze (no final "r") script described below handles all these default paths and variables and makes everything easier if you want to use the defaults.

The easy way: Using the analyze script

To ease the invocation of the program, a script named analyze (no final "r") is provided. This is script is able to locate default configuration files, define library search paths, and handle whether you want the client-server or the straight version.

The sample main program is called with the command:

analyze [-f config-file] [options]

If -f config-file is not specified, a file named analyzer.cfg is searched in the current working directory.

If -f config-file is specified but not found in the current directory, it will be searched in FreeLing installation directory, which is one of:

  • /usr/local/share/freeling/config if you installed from source
  • /usr/share/freeling/config if you used a binary .deb package).
  • myfreeling/share/freeling/config if you used --prefix=myfreeling option with ./configure.

Extra options may be specified in the command line to override any settings in config-file. See section Valid Options.

Stand-alone mode

The default mode will launch a stand-alone analyzer, which will load the configuration, read input from stdin, write results to stdout, and exit. E.g.:

analyze -f en.cfg <myinput >myoutput

When the input file ends, the analyzer will stop and it will have to be reloaded again to process a new file.

Client/server mode

If --server and --port options are specified, a server will be launched which starts listening for incoming requests. E.g.:

analyze -f en.cfg --server --port 50005 &

Once the server is launched, clients can request analysis to the server, with:

analyzer_client 50005 <myinput >myoutput
analyzer_client localhost:50005 <myinput >myoutput

or, from a remote machine:

analyzer_client my.server.com:50005 <myinput >myoutput
analyzer_client 192.168.10.11:50005 <myinput >myoutput

The server will fork a new process to attend each new client, so you can have many clients being served at the same time.

You can control the maximum amount of clients being attended simutaneously (in order to prevent a flood in your server) with the option --workers. You can control the size of the queue of pending clients with option --queue. Clients trying to connect when the queue is full will receive a connection error. See section Valid Options for details on these options.

Using a threaded analyzer

If libboost_thread is installed, the installation process will build the program threaded_analyzer. This program behaves like analyzer, and has almost the same options.

The program threaded_analyzer launches each processor in a separate thread, so while one sentece is being parsed, the next is being tagged, and the following one is running through the morphological analyzer. In this way, the multi-core capabilities of the host are better exploited and the analyzer runs faster.

Although it is intended mainly as an example for developers wanting to build their own threaded applications, this program can also be used to analyze texts, in the same way than analyzer.

Nevertheless, notice that this example program does not include modules that are not token- or sentence-oriented, namely, language identification and coreference resolution.

Usage example

Assuming we have the following input file mytext.txt:

El gato come pescado. Pero a Don Jaime no le gustan los gatos.

we could issue the command:

analyze -f myconfig.cfg <mytext.txt >mytext.mrf

Assuming that myconfig.cfg is the file presented in section Sample Configuration File, the produced output would correspond to a morfo output level (i.e. morphological analysis but no PoS tagging). The expected results are:

 El el DA0MS0 1 
 gato gato NCMS000 1 
 come comer VMIP3S0 0.75 comer VMM02S0 0.25 
 pescado pescado NCMS000 0.833333 pescar VMP00SM 0.166667 
 . . Fp 1 

 Pero pero CC 0.99878 pero NCMS000 0.00121951 Pero NP00000 0.00121951 
 a a NCFS000 0.0054008 a SPS00 0.994599 
 Don_Jaime Don_Jaime NP00000 1 
 no no NCMS000 0.00231911 no RN 0.997681 
 le él PP3CSD00 1 
 gustan gustar VMIP3P0 1 
 los el DA0MP0 0.975719 lo NCMP000 0.00019425 él PP3MPA00 0.024087 
 gatos gato NCMP000 1 
 . . Fp 1

If we also wanted PoS tagging, we could have issued the command:

analyze -f myconfig.cfg --outlv tagged <mytext.txt >mytext.tag

to obtain the tagged output:

 El el DA0MS0
 gato gato NCMS000
 come comer VMIP3S0
 pescado pescado NCMS000
 . . Fp

 Pero pero CC
 a a SPS00
 Don_Jaime Don_Jaime NP00000
 no no RN
 le él PP3CSD00
 gustan gustar VMIP3P0
 los el DA0MP0
 gatos gato NCMP000
 . . Fp

We can also ask for the senses of the tagged words:

analyze -f myconfig.cfg --outlv tagged --sense all <mytext.txt >mytext.sen

obtaining the output:

El el DA0MS0
 gato gato NCMS000 01630731:07221232:01631653
 come comer VMIP3S0 00794578:00793267
 pescado pescado NCMS000 05810856:02006311
 . . Fp

 Pero pero CC
 a a SPS00
 Don_Jaime Don_Jaime NP00000
 no no RN
 le él PP3CSD00
 gustan gustar VMIP3P0 01244897:01213391:01241953
 los el DA0MP0
 gatos gato NCMP000 01630731:07221232:01631653
 . . Fp

Alternatively, if we don't want to repeat the first steps that we had already performed, we could use the output of the morphological analyzer as input to the tagger:

analyze -f myconfig.cfg --inplv morfo --outlv tagged <mytext.mrf >mytext.tag

See options InputLevel, OutputLevel, InputFormat, and OutputFormat in section Valid options for details on which are valid input and output levels and formats.

Configuration File and Command Line Options

Almost all options may be specified either in the configuration file or in the command line, having the later precedence over the former.

Valid options are presented in section Valid options, both in their command-line and configuration file notations. Configuration files follow the usual linux standards. A sample file may be seen in section Sample Configuration File.

FreeLing package includes default configuration files. They can be found at the directory share/FreeLing/config under the FreeLing installation directory (/usr/local if you installed from source, and /usr/share/FreeLing if you used a binary .deb package). The analyze script will try to locate the configuration file in that directory if it is not found in the current working directory.


Valid Options

This section presents the options that can be given to the analyzer program (and thus, also to the analyzer_server program and to the analyze script). All options can be written in the configuration file as well as in the command line. The later has always precedence over the former.


Help
Command line Configuration file
-h, --help, --help-cf N/A

Prints to stdout a help screen with valid options and exits.
--help provides information about command line options.
--help-cf provides information about configuration file options.


Version number
Command line Configuration file
-v, --version N/A

Prints the version number of currently installed FreeLing library.


Configuration file
Command line Configuration file
-f <filename> N/A

Server mode
Command line Configuration file
--server ServerMode=(yes/y/on/no/n/off)

Activate server mode. Requires that option --port is also provided.
Default value is off.


Server Port Number
Command line Configuration file
-p <int>, --port <int> ServerPort=<int>

Specify port where server will be listening for requests. This option must be specified if server mode is active, and it is ignored if server mode is off.


Maximum Number of Server Workers
Command line Configuration file
-w <int>, --workers <int> ServerMaxWorkers=<int>

Specify maximum number of active workers that the server will launch. Each worker attends a client, so this is the maximum number of clients that are simultaneously attended. This option is ignored if server mode is off.

Default vaule is 5. Note that a high number of simultaneous workers will result in forking that many processes, which may overload the CPU and memory of your machine resulting in a system collapse.

When the maximum number of workers is reached, new incoming requests are queued until a worker finishes.


Maximum Size of Server Queue
Command line Configuration file
-q <int>, --queue <int> ServerQueueSize=<int>

Specify maximum number of pending clients that the server socket can hold. This option is ignored if server mode is off.

Pending clients are requests waiting for a worker to be available. They are queued in the operating system socket queue.

Default value is 32. Note that the operating system has an internal limit for the socket queue size (e.g. modern linux kernels set it to 128). If the given value is higher than the operating system limit, it will be ignored.

When the pending queue is full, new incoming requests get a connection error.


Trace Level
Command line Configuration file
-l <int>, --tlevel <int> TraceLevel=<int>

Set the trace level (0 = no trace, higher values = more trace), for debugging purposes.

This will work only if the library was compiled with tracing information, using ./configure -enable-traces. Note that the code with tracing information is slower than the code compiled without it, even when traces are not active.


Trace Module
Command line Configuration file
-m <mask>, --tmod <mask> TraceModule=<mask>

Specify modules to trace. Each module is identified with an hexadecimal flag. All flags may be OR-ed to specificy the set of modules to be traced.

Valid masks are defined in file src/include/freeling/morfo/traces.h, and are the following:

Module Mask
Splitter 0x00000001
Tokenizer 0x00000002
Morphological analyzer 0x00000004
Language Identifier 0x00000008
Numbers detection 0x00000010
Date/time detection 0x00000020
Punctuation 0x00000040
Dictionary 0x00000080
Affixes 0x00000100
Multiwords 0x00000200
NE Recognition 0x00000400
Probabilities 0x00000800
Quantities detection 0x00001000
NE Classification 0x00002000
Automat (abstract) 0x00004000
PoS Tagger 0x00008000
Sense annotation 0x00010000
Chart parser 0x00020000
Chart grammar 0x00040000
Dependency parser 0x00080000
Coreference resolution 0x00100000
Basic utilities 0x00200000
WSD 0x00400000
Alternatives 0x00800000
Database access 0x01000000
Feature Extraction 0x02000000
Machine Learning modules 0x04000000
Phonetic encoding 0x08000000
Mention detection 0x10000000
Input/Output 0x20000000
Semantic graph extraction 0x40000000
Summarizer 0x80000000

Language of input text
Command line Configuration file
--lang <language> Lang=<language>

Code for language of input text. Though it is not required, the convention is to use two-letter ISO codes (as: Asturian, es: Spanish, ca: Catalan, en: English, cy: Welsh, it: Italian, gl: Galician, pt: Portuguese, ru: Russian, old-es: old Spanish, etc).

Other languages may be added to the library. See chapter Adding Support for New Languages for details.


Locale
Command line Configuration file
--locale <locale> Locale=<locale>

Locale to be used to interpret both input text and data files. Usually, the value will match the locale of the Lang option (e.g. es_ES.utf8 for spanish, ca_ES.utf8 for Catalan, etc.). The values default (stands for en_US.utf8) and system (stands for currently active system locale) may also be used.


Splitter Buffer Flushing
Command line Configuration file
--flush, --noflush AlwaysFlush=(yes/y/on/no/n/off)

When this option is inactive (most usual choice) sentence splitter buffers lines until a sentence marker is found. Then, it outputs a complete sentence.

When this option is active, the splitter never buffers any token, and considers each newline as a sentence end, thus processing each line as an independent sentence.


Input Format
Command line Configuration file
--input <string> InputFormat=<string>

Input format in which to expect text to analyze.

Valid values are:
* text: Plain text. * freeling: pseudo-column format produced by freeling with output level morfo or tagged. * conll: CoNLL-like column format.


Input CoNLL format definition file
Command line Configuration file
--iconll <filename> InputConllConfig=<filename>

Configuration file for input CoNLL format. Defines which columns --and in which order-- must be read. See section Input/Output Handling Modules for details on the file format.

This option is valid only when InputFormat=conll. Otherwise, it is ignored.


Output Format
Command line Configuration file
--output <string> OutputFormat=<string>

Output format to produce with analysis results.

Valid values are:

  • freeling: Classical freeling format. It may be a pseudo-column for with output levels morfo or tagged, parenthesized trees for parsing output, or other human-readable output for coreferences or semantic graph output.
  • conll: CoNLL-like column format.
  • xml: FreeLing-specific XML format.
  • json: JSON format
  • naf: XML format following NAF conventions (see https://github.com/newsreader/NAF)
  • train: Produce freeling pseudo-column format suitable to train PoS taggers. This option can be used to annotate a corpus, correct the output manually, and use it to retrain the taggers with the script src/utilities/train-tagger/bin/TRAIN.sh provided in FreeLing package. See src/utilities/train-tagger/README for details about how to use it.

Output CoNLL format definition file
Command line Configuration file
--oconll <filename> OutputConllConfig=<filename>

Configuration file for out CoNLL format. Defines which columns --and in which order-- must be written. See section Input/Output Handling Modules for details on the file format.

This option is valid only when OutputFormat=conll. Otherwise, it is ignored.


Input Level
Command line Configuration file
--inplv <string> InputLevel=<string>

Analysis level of input data (plain, token, splitted, morfo, tagged, shallow, dep, coref).

  • plain: plain text.
  • token: tokenized text (one token per line).
  • splitted : tokenized and sentence-splitted text (one token per line, sentences separated with one blank line).
  • morfo: tokenized, sentence-splitted, and morphologically analyzed text. One token per line, sentences separated with one blank line. Each line has the format: word (lemma tag prob)+
  • tagged: tokenized, sentence-splitted, morphologically analyzed, and PoS-tagged text. One token per line, sentences separated with one blank line. Each line has the format: word lemma tag.
  • shallow: the previous plus constituency parsing. Only valid with InputFormat=conll.
  • dep: the previous plus dependency parsing (may include constituents or not. May include also SRL). Only valid with InputFormat=conll.
  • coref: the previous plus coreference. Only valid with InputFormat=conll.

Output Level
Command line Configuration file
--outlv <string> OutputLevel=<string>

Analysis level of output data (ident, token, splitted, morfo, tagged, shallow, dep, coref, semgraph).

  • ident: perform language identification instead of analysis.
  • token: tokenized text (one token per line).
  • splitted : tokenized and sentence-splitted text (one token per line, sentences separated with one blank line).
  • morfo: tokenized, sentence-splitted, and morphologically analyzed text. One token per line, sentences separated with one blank line.
  • tagged: tokenized, sentence-splitted, morphologically analyzed, and PoS-tagged text. One token per line, sentences separated with one blank line.
  • shallow: tokenized, sentence-splitted, morphologically analyzed, PoS-tagged, optionally sense-annotated, and shallow-parsed text, produced by the chart_parser module.
  • parsed: tokenized, sentence-splitted, morphologically analyzed, PoS-tagged, optionally sense-annotated, and full-parsed text, as output by the first stage (tree completion) of the rule-based dependency parser.
  • dep: tokenized, sentence-splitted, morphologically analyzed, PoS-tagged, optionally sense-annotated, and dependency-parsed text, as output by the second stage (transformation to dependencies and function labelling) of the dependency parser. May include also SRL if the statistical parser is used (and SRL is available for the input language).
  • coref: the previous plus coreference.
  • semgraph: the previous plus semantic graph. Only valid with OutputFormat=xml|json|freeling.

Language Identification Configuration File
Command line Configuration file
-I <filename>, --fidn <filename> N/A

Configuration file for language identifier.


Tokenizer File
Command line Configuration file
--ftok <filename> TokenizerFile=<filename>

File of tokenization rules.


Splitter File
Command line Configuration file
--fsplit <filename> SplitterFile=<filename>

File of splitter rules.


Affix Analysis
Command line Configuration file
--afx, --noafx AffixAnalysis=(yes/y/on/no/n/off)

Whether to perform affix analysis on unknown words. Affix analysis applies a set of affixation rules to the word to check whether it is a derived form of a known word.


Affixation Rules File
Command line Configuration file
-S <filename>, --fafx <filename> AffixFile=<filename>

Affix rules file, used by dictionary module.


User Map
Command line Configuration file
--usr, --nousr UserMap=(yes/y/on/no/n/off)

Whether to apply or not a file of customized word-tag mappings.


User Map File
Command line Configuration file
-M <filename>, --fmap <filename> UserMapFile=<filename>

User Map file to be used.


Multiword Detection
Command line Configuration file
--loc, --noloc MultiwordsDetection=(yes/y/on/no/n/off)

Whether to perform multiword detection. This option requires that a multiword file is provided.


Multiword File
Command line Configuration file
-L <filename>, --floc <filename> LocutionsFile=<filename>

Multiword definition file.


Number Detection
Command line Configuration file
--numb, --nonumb NumbersDetection=(yes/y/on/no/n/off)

Whether to perform nummerical expression detection. Deactivating this feature will affect the behaviour of date/time and ratio/currency detection modules.


Decimal Point
Command line Configuration file
--dec <string> DecimalPoint=<string>

Specify decimal point character for the number detection module (for instance, in English is a dot, but in Spanish is a comma).


Thousand Point
Command line Configuration file
--thou <string> ThousandPoint=<string>

Specify thousand point character for the number detection module (for instance, in English is a comma, but in Spanish is a dot).


Punctuation Detection
Command line Configuration file
--punt, --nopunt PunctuationDetection=(yes/y/on/no/n/off)

Whether to assign PoS tag to punctuation signs.


Punctuation Detection File
Command line Configuration file
-F <filename>, --fpunct <filename> PunctuationFile=<filename>

Punctuation symbols file.


Date Detection
Command line Configuration file
--date, --nodate DatesDetection=(yes/y/on/no/n/off)

Whether to perform date and time expression detection.


Quantities Detection
Command line Configuration file
--quant, --noquant QuantitiesDetection=(yes/y/on/no/n/off)

Whether to perform currency amounts, physical magnitudes, and ratio detection.


Quantity Recognition File
Command line Configuration file
-Q <filename>, --fqty <filename> QuantitiesFile=<filename>

Quantitiy recognition configuration file.


Command line Configuration file
--dict, --nodict DictionarySearch=(yes/y/on/no/n/off)

Whether to search word forms in dictionary. Deactivating this feature also deactivates AffixAnalysis option.


Dictionary File
Command line Configuration file
-D <filename>, --fdict <filename> DictionaryFile=<filename>

Dictionary database.


Probability Assignment
Command line Configuration file
--prob, --noprob ProbabilityAssignment=(yes/y/on/no/n/off)

Whether to compute a lexical probability for each tag of each word. Deactivating this feature will affect the behaviour of the PoS tagger.


Lexical Probabilities File
Command line Configuration file
-P <filename>, --fprob <filename> ProbabilityFile=<filename>

Lexical probabilities file. The probabilities in this file are used to compute the most likely tag for a word, as well to estimate the likely tags for unknown words.


Unknown Words Probability Threshold.
Command line Configuration file
-e <float>, --thres <float> ProbabilityThreshold=<float>

Threshold that must be reached by the probability of a tag given the suffix of an unknown word in order to be included in the list of possible tags for that word. Default is zero (all tags are included in the list). A non-zero value (e.g. 0.0001, 0.001) is recommended.


Named Entity Recognition
Command line Configuration file
--ner, --noner NERecognition=(yes/y/on/no/n/off)

Whether to perform NE recognition.


Named Entity Recognizer File
Command line Configuration file
-N <filename>, --fnp <filename> NPDataFile=<filename>

Configuration data file for NE recognizer.


Named Entity Classification
Command line Configuration file
--nec, --nonec NEClassification=(yes/y/on/no/n/off)

Whether to perform NE classification.


Named Entity Classifier File
Command line Configuration file
--fnec <filename> NECFile=<filename>

Configuration file for Named Entity Classifier module


Phonetic Encoding
Command line Configuration file
--phon, --nophon Phonetics=(yes/y/on/no/n/off)

Whether to add phonetic transcription to each word.


Phonetic Encoder File
Command line Configuration file
--fphon <filename> PhoneticsFile=<filename>

Configuration file for phonetic encoding module


Sense Annotation
Command line Configuration file
-s <string>, --sense <string> SenseAnnotation=<string>

Kind of sense annotation to perform

  • no, none: Deactivate sense annotation.
  • all: annotate with all possible senses in sense dictionary.
  • mfs: annotate with most frequent sense.
  • ukb: annotate all senses, ranked by UKB algorithm.

Whether to perform sense anotation.

If active, the PoS tag selected by the tagger for each word is enriched with a list of all its possible WN synsets. The sense repository used depends on the options Sense Annotation Configuration File'' andUKB Word Sense Disambiguator Configuration File'' described below.


Sense Annotation Configuration File
Command line Configuration file
-W <filename>, --fsense <filename> SenseConfigFile=<filename>

Word sense annotator configuration file.


UKB Word Sense Disambiguator Configuration File
Command line Configuration file
-U <filename>, --fukb <filename> UKBConfigFile=<filename>

UKB configuration file.


Tagger algorithm
Command line Configuration file
-t <string>, --tag <string> Tagger=<string>

Algorithm to use for PoS tagging

  • hmm: Hidden Markov Model tagger, based on [Bra00].
  • relax: Relaxation Labelling tagger, based on [Pad98].

HMM Tagger configuration File
Command line Configuration file
-H <filename>, --hmm <filename> TaggerHMMFile=<filename>

Parameters file for HMM tagger.


Relaxation labelling tagger constraints file
Command line Configuration file
-R <filename>, --rlx <filename> TaggerRelaxFile=<filename>

File containing the constraints to apply to solve the PoS tagging.


Relaxation labelling tagger iteration limit
Command line Configuration file
-i <int>, --iter <int> TaggerRelaxMaxIter=<int>

Maximum numbers of iterations to perform in case relaxation does not converge.


Relaxation labelling tagger scale factor
Command line Configuration file
-r <float>, --sf <float> TaggerRelaxScaleFactor=<float>

Scale factor to normalize supports inside RL algorithm. It is comparable to the step lenght in a hill-climbing algorithm: The larger scale factor, the smaller step.


Relaxation labelling tagger epsilon value
Command line Configuration file
--eps <float> TaggerRelaxEpsilon=<float>

Real value used to determine when a relaxation labelling iteration has produced no significant changes. The algorithm stops when no weight has changed above the specified epsilon.


Retokenize contractions in dictionary
Command line Configuration file
--rtkcon, --nortkcon RetokContractions=(yes/y/on/no/n/off)

Specifies whether the dictionary must retokenize contractions when found, or leave the decision to the TaggerRetokenize option.

Note that if this option is active, contractions will be retokenized even if the TaggerRetokenize option is not active. If this option is not active, contractions will be retokenized depending on the value of the TaggerRetokenize option.


Retokenize after tagging
Command line Configuration file
--rtk, --nortk TaggerRetokenize=(yes/y/on/no/n/off)

Determine whether the tagger must perform retokenization after the appropriate analysis has been selected for each word. This is closely related to affix analysis and PoS taggers.


Force the selection of one unique tag
Command line Configuration file
--force <string> TaggerForceSelect=(none/tagger/retok)

Determine whether the tagger must be forced to (probably randomly) make a unique choice and when.

  • none: Do not force the tagger, allow ambiguous output.
  • tagger: Force the tagger to choose before retokenization (i.e. if retokenization introduces any ambiguity, it will be present in the final output).
  • retok: Force the tagger to choose after retokenization (no remaining ambiguity)

Chart Parser Grammar File
Command line Configuration file
-G <filename>, --grammar <filename> GrammarFile=<filename>

This file contains a CFG grammar for the chart parser, and some directives to control which chart edges are selected to build the final tree.


Dependency Parser Rule File
Command line Configuration file
-T <filename>, --txala <filename> DepTxalaFile==<filename>

Rules to be used to perform rule-based dependency analysis.


Statistical Dependency Parser File
Command line Configuration file
-E <filename>, --treeler <filename> DepTreelerFile==<filename>

Configuration file for statistical dependency parser and SRL module


Dependency Parser Selection
Command line Configuration file
-d <string>, --dep <string> DependencyParser==<string>

Which dependency parser to use. Valid values are: * txala (rule-based) * treeler (statistical, may have SRL also).


Coreference Resolution File
Command line Configuration file
-C <filename>, --fcorf <filename> CorefFile=<filename>

Configuration file for coreference resolution module.

Sample Configuration File

A sample configuration file follows. You can start using freeling with the default configuration files which are installed at /usr/local/share/freeling/config (note than prefix /usr/local may differ if you specified an alternative location when installing FreeLing. If you installed from a binary .deb package), it will be at /usr/share/freeling/config.

You can use those files as a starting point to customize one configuration file to suit your needs.

Note that file paths in the sample configuration file contain $FREELINGSHARE, which is supposed to be an environment variable. If this variable is not defined, the analyzer will abort, complaining about not finding the files.

If you use the analyze script, it will define the variable for you as /usr/local/share/Freeling (or the right installation path), unless you define it to point somewhere else.

You can also adjust your configuration files to use normal paths for the files (either relative or absolute) instead of using variables.

##
#### default configuration file for Spanish analyzer
##

#### General options 
Lang=es
Locale=default

### Tagset description file, used by different modules
TagsetFile=$FREELINGSHARE/es/tagset.dat

## Traces (deactivated)
TraceLevel=0
TraceModule=0x0000

## Options to control the applied modules. The input may be partially
## processed, or not a full analysis may me wanted. The specific 
## formats are a choice of the main program using the library, as well
## as the responsability of calling only the required modules.
InputLevel=text
OutputLevel=morfo

# Do not consider each newline as a sentence end
AlwaysFlush=no

#### Tokenizer options
TokenizerFile=$FREELINGSHARE/es/tokenizer.dat

#### Splitter options
SplitterFile=$FREELINGSHARE/es/splitter.dat

#### Morfo options
AffixAnalysis=yes
CompoundAnalysis=yes
MultiwordsDetection=yes
NumbersDetection=yes
PunctuationDetection=yes
DatesDetection=yes
QuantitiesDetection=yes
DictionarySearch=yes
ProbabilityAssignment=yes
DecimalPoint=,
ThousandPoint=.
LocutionsFile=$FREELINGSHARE/es/locucions.dat 
QuantitiesFile=$FREELINGSHARE/es/quantities.dat
AffixFile=$FREELINGSHARE/es/afixos.dat
CompoundFile=$FREELINGSHARE/es/compounds.dat
ProbabilityFile=$FREELINGSHARE/es/probabilitats.dat
DictionaryFile=$FREELINGSHARE/es/dicc.src
PunctuationFile=$FREELINGSHARE/common/punct.dat
ProbabilityThreshold=0.001

# NER options 
NERecognition=yes
NPDataFile=$FREELINGSHARE/es/np.dat
## comment line above and uncomment one of those below, if you want 
## a better NE recognizer (higer accuracy, lower speed)
#NPDataFile=$FREELINGSHARE/es/nerc/ner/ner-ab-poor1.dat
#NPDataFile=$FREELINGSHARE/es/nerc/ner/ner-ab-rich.dat
# "rich" model is trained with rich gazetteer. Offers higher accuracy but 
# requires adapting gazetteer files to have high coverage on target corpus.
# "poor1" model is trained with poor gazetteer. Accuracy is splightly lower
# but suffers small accuracy loss the gazetteer has low coverage in target corpus.
# If in doubt, use "poor1" model.

## Phonetic encoding of words.
Phonetics=no
PhoneticsFile=$FREELINGSHARE/es/phonetics.dat

## NEC options. See README in common/nec
NEClassification=no
NECFile=$FREELINGSHARE/es/nerc/nec/nec-ab-poor1.dat
#NECFile=$FREELINGSHARE/es/nerc/nec/nec-ab-rich.dat

## Sense annotation options (none,all,mfs,ukb)
SenseAnnotation=none
SenseConfigFile=$FREELINGSHARE/es/senses.dat
UKBConfigFile=$FREELINGSHARE/es/ukb.dat

#### Tagger options
Tagger=hmm
TaggerHMMFile=$FREELINGSHARE/es/tagger.dat
TaggerRelaxFile=$FREELINGSHARE/es/constr_gram-B.dat
TaggerRelaxMaxIter=500
TaggerRelaxScaleFactor=670.0
TaggerRelaxEpsilon=0.001
TaggerRetokenize=yes
TaggerForceSelect=tagger

#### Parser options
GrammarFile=$FREELINGSHARE/es/chunker/grammar-chunk.dat

#### Dependence Parser options
DependencyParser=txala
DepTxalaFile=$FREELINGSHARE/es/dep_txala/dependences.dat
DepTreelerFile=$FREELINGSHARE/es/dep_treeler/dependences.dat

#### Coreference Solver options
CorefFile=$FREELINGSHARE/es/coref/relaxcor/relaxcor.dat
SemGraphExtractorFile=$FREELINGSHARE/es/semgraph/semgraph-SRL.dat