auditok.core.StreamTokenizer¶

class auditok.core.StreamTokenizer(validator, min_length, max_length, max_continuous_silence, init_min=0, init_max_silence=0, mode=0)[source]¶

Class for stream tokenizers. It implements a 4-state automaton scheme to extract sub-sequences of interest on the fly.

Parameters:

validator (callable, DataValidator (must implement is_valid)) – called with each data frame read from source. Should take one positional argument and return True or False for valid and invalid frames respectively.
min_length (int) – Minimum number of frames of a valid token. This includes all tolerated non valid frames within the token.
max_length (int) – Maximum number of frames of a valid token. This includes all tolerated non valid frames within the token.
max_continuous_silence (int) – Maximum number of consecutive non-valid frames within a token. Note that, within a valid token, there may be many tolerated silent regions that contain each a number of non valid frames up to max_continuous_silence
init_min (int) – Minimum number of consecutive valid frames that must be initially gathered before any sequence of non valid frames can be tolerated. This option is not always needed, it can be used to drop non-valid tokens as early as possible. Default = 0 means that the option is by default ineffective.
init_max_silence (int) – Maximum number of tolerated consecutive non-valid frames if the number already gathered valid frames has not yet reached ‘init_min’.This argument is normally used if init_min is used. Default = 0, by default this argument is not taken into consideration.
mode (int) –
mode can be one of the following:
-1 StreamTokenizer.NORMAL : do not drop trailing silence, and accept a token shorter than min_length if it is the continuation of the latest delivered token.
-2 StreamTokenizer.STRICT_MIN_LENGTH: if token i is delivered because max_length is reached, and token i+1 is immediately adjacent to token i (i.e. token i ends at frame k and token i+1 starts at frame k+1) then accept token i+1 only of it has a size of at least min_length. The default behavior is to accept token i+1 event if it is shorter than min_length (provided that the above conditions are fulfilled of course).

-3 StreamTokenizer.DROP_TRAILING_SILENCE: drop all tailing non-valid frames from a token to be delivered if and only if it is not truncated. This can be a bit tricky. A token is actually delivered if:
max_continuous_silence is reached.

Its length reaches max_length. This is referred to as a truncated token.
In the current implementation, a StreamTokenizer’s decision is only based on already seen data and on incoming data. Thus, if a token is truncated at a non-valid but tolerated frame (max_length is reached but max_continuous_silence not yet) any tailing silence will be kept because it can potentially be part of valid token (if max_length was bigger). But if max_continuous_silence is reached before max_length, the delivered token will not be considered as truncated but a result of normal end of detection (i.e. no more valid data). In that case the trailing silence can be removed if you use the StreamTokenizer.DROP_TRAILING_SILENCE mode.

-4 (StreamTokenizer.STRICT_MIN_LENGTH | StreamTokenizer.DROP_TRAILING_SILENCE): use both options. That means: first remove tailing silence, then check if the token still has a length of at least min_length.

Examples

In the following code, without STRICT_MIN_LENGTH, the ‘BB’ token is accepted although it is shorter than min_length (3), because it immediately follows the latest delivered token:

>>> from auditok.core import StreamTokenizer
>>> from StringDataSource, DataValidator

>>> class UpperCaseChecker(DataValidator):
>>>     def is_valid(self, frame):
            return frame.isupper()
>>> dsource = StringDataSource("aaaAAAABBbbb")
>>> tokenizer = StreamTokenizer(validator=UpperCaseChecker(),
                                min_length=3,
                                max_length=4,
                                max_continuous_silence=0)
>>> tokenizer.tokenize(dsource)
[(['A', 'A', 'A', 'A'], 3, 6), (['B', 'B'], 7, 8)]

The following tokenizer will however reject the ‘BB’ token:

>>> dsource = StringDataSource("aaaAAAABBbbb")
>>> tokenizer = StreamTokenizer(validator=UpperCaseChecker(),
                                min_length=3, max_length=4,
                                max_continuous_silence=0,
                                mode=StreamTokenizer.STRICT_MIN_LENGTH)
>>> tokenizer.tokenize(dsource)
[(['A', 'A', 'A', 'A'], 3, 6)]

>>> tokenizer = StreamTokenizer(
>>>                validator=UpperCaseChecker(),
>>>                min_length=3,
>>>                max_length=6,
>>>                max_continuous_silence=3,
>>>                mode=StreamTokenizer.DROP_TRAILING_SILENCE
>>>                )
>>> dsource = StringDataSource("aaaAAAaaaBBbbbb")
>>> tokenizer.tokenize(dsource)
[(['A', 'A', 'A', 'a', 'a', 'a'], 3, 8), (['B', 'B'], 9, 10)]

The first token is delivered with its tailing silence because it is truncated while the second one has its tailing frames removed.

Without StreamTokenizer.DROP_TRAILING_SILENCE the output would be:

[
    (['A', 'A', 'A', 'a', 'a', 'a'], 3, 8),
    (['B', 'B', 'b', 'b', 'b'], 9, 13)
]

__init__(validator, min_length, max_length, max_continuous_silence, init_min=0, init_max_silence=0, mode=0)[source]¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`(validator, min_length, max_length, …)	Initialize self.
`tokenize`(data_source[, callback, generator])	Read data from data_source, one frame a time, and process the read frames in order to detect sequences of frames that make up valid tokens.

Attributes

`DROP_TRAILING_SILENCE`
`NOISE`
`NORMAL`
`POSSIBLE_NOISE`
`POSSIBLE_SILENCE`
`SILENCE`
`STRICT_MIN_LENGTH`