DeeAnalyzer

DeeAnalyzer — Primary gateway for data indexing

Functions

Types and Values

struct DeeAnalyzer

Object Hierarchy

    GObject
    ╰── DeeAnalyzer
        ╰── DeeTextAnalyzer

Includes

#include <dee.h>

Description

A DeeAnalyzer takes a text stream, splits it into tokens, and runs the tokens through a series of filtering steps. Optionally outputs collation keys for the terms.

One of the important use cases of analyzers in Dee is as vessel for the indexing logic for creating a DeeIndex from a DeeModel.

The recommended way to implement your own custom analyzers are by either adding term filters to a DeeAnalyzer or DeeTextAnalyzer instance with dee_analyzer_add_term_filter() and/or derive your own subclass that overrides the dee_analyzer_tokenize() method. Should you have very special requirements it is possible to reimplement all aspects of the analyzer class though.

Functions

DeeCollatorFunc ()

gchar *
(*DeeCollatorFunc) (const gchar *input,
                    gpointer data);

A collator takes an input string, most often a term produced from a DeeAnalyzer, and outputs a collation key.

Parameters

input

The string to produce a collation key for

 

data

User data set when registering the collator.

[closure]

Returns

The collation key. Free with g_free() when done using it.

[transfer full]


DeeTermFilterFunc ()

void
(*DeeTermFilterFunc) (DeeTermList *terms_in,
                      DeeTermList *terms_out,
                      gpointer filter_data);

A term filter takes a list of terms and runs it through a filtering and/or set of transformations and stores the output in a DeeTermList.

You can register term filters on a DeeAnalyzer with dee_analyzer_add_term_filter().

Parameters

terms_in

A DeeTermList with the terms to filter

 

terms_out

A DeeTermList to write the filtered terms to

 

filter_data

User data set when registering the filter.

[closure]

Returns

Nothing. Output is stored in terms_out .


dee_analyzer_analyze ()

void
dee_analyzer_analyze (DeeAnalyzer *self,
                      const gchar *data,
                      DeeTermList *terms_out,
                      DeeTermList *colkeys_out);

Extract terms and or collation keys from some input data (which is normally, but not necessarily, a UTF-8 string).

The terms and corresponding collation keys will be written in order to the provided DeeTermLists.

Implementation notes for subclasses: The analysis process must call dee_analyzer_tokenize() and run the tokens through all term filters added with dee_analyzer_add_term_filter(). Collation keys must be generated with dee_analyzer_collate_key().

Parameters

self

The analyzer to use

 

data

The input data to analyze

 

terms_out

A DeeTermList to place the generated terms in. If NULL to terms are generated.

[allow-none]

colkeys_out

A DeeTermList to place generated collation keys in. If NULL no collation keys are generated.

[allow-none]

dee_analyzer_tokenize ()

void
dee_analyzer_tokenize (DeeAnalyzer *self,
                       const gchar *data,
                       DeeTermList *terms_out);

Tokenize some input data (which is normally, but not necessarily, a UTF-8 string).

Tokenization splits the input data into constituents (in most cases words), but does not run it through any of the term filters set for the analyzer. It is undefined if the tokenization process itself does any normalization.

Parameters

self

The analyzer to use

 

data

The input data to analyze

 

terms_out

A DeeTermList to place the generated tokens in.

 

dee_analyzer_add_term_filter ()

void
dee_analyzer_add_term_filter (DeeAnalyzer *self,
                              DeeTermFilterFunc filter_func,
                              gpointer filter_data,
                              GDestroyNotify filter_destroy);

Register a DeeTermFilterFunc to be called whenever dee_analyzer_analyze() is called.

Term filters can be used to normalize, add, or remove terms from an input data stream.

Parameters

self

The analyzer to add a term filter to

 

filter_func

Function to call.

[scope notified]

filter_data

Data to pass to filter_func when it is invoked.

[closure]

filter_destroy

Called on filter_data when the DeeAnalyzer owning the filter is destroyed.

[allow-none]

dee_analyzer_collate_key ()

gchar *
dee_analyzer_collate_key (DeeAnalyzer *self,
                          const gchar *data);

Generate a collation key for a set of input data (usually a UTF-8 string passed through tokenization and term filters of the analyzer).

The default implementation just calls g_strdup().

Parameters

self

The analyzer to generate a collation key with

 

data

The input data to generate a collation key for

 

Returns

A newly allocated collation key. Use dee_analyzer_collate_cmp() or dee_analyzer_collate_cmp_func() to compare collation keys. Free with g_free().


dee_analyzer_collate_cmp ()

gint
dee_analyzer_collate_cmp (DeeAnalyzer *self,
                          const gchar *key1,
                          const gchar *key2);

Compare collation keys generated by dee_analyzer_collate_key() with similar semantics as strcmp(). See also dee_analyzer_collate_cmp_func() if you need a version of this function that works as a GCompareDataFunc.

The default implementation in DeeAnalyzer just uses strcmp().

Parameters

self

The analyzer to use when comparing collation keys

 

key1

The first collation key to compare

 

key2

The second collation key to compare

 

Returns

-1, 0 or 1, if key1 is <, == or > than key2 .


dee_analyzer_collate_cmp_func ()

gint
dee_analyzer_collate_cmp_func (const gchar *key1,
                               const gchar *key2,
                               gpointer analyzer);

A GCompareDataFunc using a DeeAnalyzer to compare the keys. This is just a convenience wrapper around dee_analyzer_collate_cmp().

Parameters

key1

The first key to compare

 

key2

The second key to compare

 

analyzer

The DeeAnalyzer to use for the comparison

 

Returns

-1, 0 or 1, if key1 is <, == or > than key2 .


dee_analyzer_new ()

DeeAnalyzer *
dee_analyzer_new (void);

Types and Values

struct DeeAnalyzer

struct DeeAnalyzer;

All fields in the DeeAnalyzer structure are private and should never be accessed directly