Core

load(input[, skip, max_read]) Load audio data from a source and return it as an AudioRegion.
split(input[, min_dur, max_dur, …]) Split audio data and return a generator of AudioRegions
AudioRegion(data, sampling_rate, …[, meta]) AudioRegion encapsulates raw audio data and provides an interface to perform simple operations on it.
StreamTokenizer(validator, min_length, …) Class for stream tokenizers.
auditok.core.load(input, skip=0, max_read=None, **kwargs)[source]

Load audio data from a source and return it as an AudioRegion.

Parameters:
  • input (None, str, bytes, AudioSource) – source to read audio data from. If str, it should be a path to a valid audio file. If bytes, it is used as raw audio data. If it is “-“, raw data will be read from stdin. If None, read audio data from the microphone using PyAudio. If of type bytes or is a path to a raw audio file then sampling_rate, sample_width and channels parameters (or their alias) are required. If it’s an AudioSource object it’s used directly to read data.
  • skip (float, default: 0) – amount, in seconds, of audio data to skip from source. If read from a microphone, skip must be 0, otherwise a ValueError is raised.
  • max_read (float, default: None) – amount, in seconds, of audio data to read from source. If read from microphone, max_read should not be None, otherwise a ValueError is raised.
  • fmt (audio_format,) – type of audio data (e.g., wav, ogg, flac, raw, etc.). This will only be used if input is a string path to an audio file. If not given, audio type will be guessed from file name extension or from file header.
  • sr (sampling_rate,) – sampling rate of audio data. Required if input is a raw audio file, a bytes object or None (i.e., read from microphone).
  • sw (sample_width,) – number of bytes used to encode one audio sample, typically 1, 2 or 4. Required for raw data, see sampling_rate.
  • ch (channels,) – number of channels of audio data. Required for raw data, see sampling_rate.
  • large_file (bool, default: False) – If True, AND if input is a path to a wav of a raw audio file (and only these two formats) then audio file is not fully loaded to memory in order to create the region (but the portion of data needed to create the region is of course loaded to memory). Set to True if max_read is significantly smaller then the size of a large audio file that shouldn’t be entirely loaded to memory.
Returns:

region

Return type:

AudioRegion

Raises:

ValueError – raised if input is None (i.e., read data from microphone) and skip != 0 or input is None max_read is None (meaning that when reading from the microphone, no data should be skipped, and maximum amount of data to read should be explicitly provided).

auditok.core.split(input, min_dur=0.2, max_dur=5, max_silence=0.3, drop_trailing_silence=False, strict_min_dur=False, **kwargs)[source]

Split audio data and return a generator of AudioRegions

Parameters:
  • input (str, bytes, AudioSource, AudioReader, AudioRegion or None) – input audio data. If str, it should be a path to an existing audio file. “-” is interpreted as standard input. If bytes, input is considered as raw audio data. If None, read audio from microphone. Every object that is not an AudioReader will be transformed into an AudioReader before processing. If it is an str that refers to a raw audio file, bytes or None, audio parameters should be provided using kwargs (i.e., samplig_rate, sample_width and channels or their alias). If input is str then audio format will be guessed from file extension. audio_format (alias fmt) kwarg can also be given to specify audio format explicitly. If none of these options is available, rely on backend (currently only pydub is supported) to load data.
  • min_dur (float, default: 0.2) – minimun duration in seconds of a detected audio event. By using large values for min_dur, very short audio events (e.g., very short 1-word utterances like ‘yes’ or ‘no’) can be mis detected. Using very short values might result in a high number of short, unuseful audio events.
  • max_dur (float, default: 5) – maximum duration in seconds of a detected audio event. If an audio event lasts more than max_dur it will be truncated. If the continuation of a truncated audio event is shorter than min_dur then this continuation is accepted as a valid audio event if strict_min_dur is False. Otherwise it is rejected.
  • max_silence (float, default: 0.3) – maximum duration of continuous silence within an audio event. There might be many silent gaps of this duration within one audio event. If the continuous silence happens at the end of the event than it’s kept as part of the event if drop_trailing_silence is False (default).
  • drop_trailing_silence (bool, default: False) – Whether to remove trailing silence from detected events. To avoid abrupt cuts in speech, trailing silence should be kept, therefore this parameter should be False.
  • strict_min_dur (bool, default: False) – strict minimum duration. Do not accept an audio event if it is shorter than min_dur even if it is contiguous to the latest valid event. This happens if the the latest detected event had reached max_dur.
Other Parameters:
 
  • analysis_window, aw (float, default: 0.05 (50 ms)) – duration of analysis window in seconds. A value between 0.01 (10 ms) and 0.1 (100 ms) should be good for most use-cases.
  • audio_format, fmt (str) – type of audio data (e.g., wav, ogg, flac, raw, etc.). This will only be used if input is a string path to an audio file. If not given, audio type will be guessed from file name extension or from file header.
  • sampling_rate, sr (int) – sampling rate of audio data. Required if input is a raw audio file, is a bytes object or None (i.e., read from microphone).
  • sample_width, sw (int) – number of bytes used to encode one audio sample, typically 1, 2 or 4. Required for raw data, see sampling_rate.
  • channels, ch (int) – number of channels of audio data. Required for raw data, see sampling_rate.
  • use_channel, uc ({None, “mix”} or int) – which channel to use for split if input has multiple audio channels. Regardless of which channel is used for splitting, returned audio events contain data from all channels, just as input. The following values are accepted:
    • None (alias “any”): accept audio activity from any channel, even if other channels are silent. This is the default behavior.
    • “mix” (“avg” or “average”): mix down all channels (i.e. compute average channel) and split the resulting channel.
    • int (0 <=, > channels): use one channel, specified by integer id, for split.
  • large_file (bool, default: False) – If True, AND if input is a path to a wav of a raw audio file (and only these two formats) then audio data is lazily loaded to memory (i.e., one analysis window a time). Otherwise the whole file is loaded to memory before split. Set to True if the size of the file is larger than available memory.
  • max_read, mr (float, default: None, read until end of stream) – maximum data to read from source in seconds.
  • validator, val (callable, DataValidator) – custom data validator. If None (default), an AudioEnergyValidor is used with the given energy threshold. Can be a callable or an instance of DataValidator that implements is_valid. In either case, it’ll be called with with a window of audio data as the first parameter.
  • energy_threshold, eth (float, default: 50) – energy threshold for audio activity detection. Audio regions that have enough windows of with a signal energy equal to or above this threshold are considered valid audio events. Here we are referring to this amount as the energy of the signal but to be more accurate, it is the log energy of computed as: 20 * log10(sqrt(dot(x, x) / len(x))) (see AudioEnergyValidator and calculate_energy_single_channel()). If validator is given, this argument is ignored.
Yields:

AudioRegion – a generator of detected AudioRegion s.

class auditok.core.AudioRegion(data, sampling_rate, sample_width, channels, meta=None)[source]

AudioRegion encapsulates raw audio data and provides an interface to perform simple operations on it. Use AudioRegion.load to build an AudioRegion from different types of objects.

Parameters:
  • data (bytes) – raw audio data as a bytes object
  • sampling_rate (int) – sampling rate of audio data
  • sample_width (int) – number of bytes of one audio sample
  • channels (int) – number of channels of audio data
  • meta (dict, default: None) – any collection of <key:value> elements used to build metadata for this AudioRegion. Meta data can be accessed via region.meta.key if key is a valid python attribute name, or via region.meta[key] if not. Note that the split() function (or the AudioRegion.split() method) returns AudioRegions with a start and a stop meta values that indicate the location in seconds of the region in original audio data.

See also

AudioRegion.load

ch

Number of channels of audio data, alias for channels.

channels

Number of channels of audio data.

duration

Returns region duration in seconds.

len

Return region length in number of samples.

classmethod load(input, skip=0, max_read=None, **kwargs)[source]

Create an AudioRegion by loading data from input. See load() for parameters descripion.

Returns:region
Return type:AudioRegion
Raises:ValueError – raised if input is None and skip != 0 or max_read is None.
millis

end]``).

Type:A view to slice audio region by milliseconds (using ``region.millis[start
play(progress_bar=False, player=None, **progress_bar_kwargs)[source]

Play audio region.

Parameters:
  • progress_bar (bool, default: False) – whether to use a progress bar while playing audio. Default: False. progress_bar requires tqdm, if not installed, no progress bar will be shown.
  • player (AudioPalyer, default: None) – audio player to use. if None (default), use player_for() to get a new audio player.
  • progress_bar_kwargs (kwargs) – keyword arguments to pass to tqdm progress_bar builder (e.g., use leave=False to clean up the screen when play finishes).
plot(scale_signal=True, show=True, figsize=None, save_as=None, dpi=120, theme='auditok')[source]

Plot audio region, one sub-plot for each channel.

Parameters:
  • scale_signal (bool, default: True) – if true, scale signal by subtracting its mean and dividing by its standard deviation before plotting.
  • show (bool) – whether to show plotted signal right after the call.
  • figsize (tuple, default: None) – width and height of the figure to pass to matplotlib.
  • save_as (str, default None.) – if provided, also save plot to file.
  • dpi (int, default: 120) – plot dpi to pass to matplotlib.
  • theme (str or dict, default: "auditok") – plot theme to use. Currently only “auditok” theme is implemented. To provide you own them see auditok.plotting.AUDITOK_PLOT_THEME.
sample_width

Number of bytes per sample, one channel considered.

samples

Audio region as arrays of samples, one array per channel.

sampling_rate

Samling rate of audio data.

save(file, audio_format=None, exists_ok=True, **audio_parameters)[source]

Save audio region to file.

Parameters:
  • file (str) – path to output audio file. May contain {duration} placeholder as well as any place holder that this region’s metadata might contain (e.g., regions returned by split contain metadata with start and end attributes that can be used to build output file name as {meta.start} and {meta.end}. See examples using placeholders with formatting.
  • audio_format (str, default: None) – format used to save audio data. If None (default), format is guessed from file name’s extension. If file name has no extension, audio data is saved as a raw (headerless) audio file.
  • exists_ok (bool, default: True) – If True, overwrite file if a file with the same name exists. If False, raise an IOError if file exists.
  • audio_parameters (dict) – any keyword arguments to be passed to audio saving backend.
Returns:

  • file (str) – name of output file with replaced placehoders.
  • Raises – IOError if file exists and exists_ok is False.

Examples

>>> region = AudioRegion(b'\0' * 2 * 24000,
>>>                      sampling_rate=16000,
>>>                      sample_width=2,
>>>                      channels=1)
>>> region.meta.start = 2.25
>>> region.meta.end = 2.25 + region.duration
>>> region.save('audio_{meta.start}-{meta.end}.wav')
>>> audio_2.25-3.75.wav
>>> region.save('region_{meta.start:.3f}_{duration:.3f}.wav')
audio_2.250_1.500.wav
seconds

end]``).

Type:A view to slice audio region by seconds (using ``region.seconds[start
split(min_dur=0.2, max_dur=5, max_silence=0.3, drop_trailing_silence=False, strict_min_dur=False, **kwargs)[source]

Split audio region. See auditok.split() for a comprehensive description of split parameters. See Also AudioRegio.split_and_plot().

split_and_plot(min_dur=0.2, max_dur=5, max_silence=0.3, drop_trailing_silence=False, strict_min_dur=False, scale_signal=True, show=True, figsize=None, save_as=None, dpi=120, theme='auditok', **kwargs)[source]

Split region and plot signal and detections. Alias: splitp(). See auditok.split() for a comprehensive description of split parameters. Also see plot() for plot parameters.

sr

Samling rate of audio data, alias for sampling_rate.

sw

Number of bytes per sample, alias for sampling_rate.

class auditok.core.StreamTokenizer(validator, min_length, max_length, max_continuous_silence, init_min=0, init_max_silence=0, mode=0)[source]

Class for stream tokenizers. It implements a 4-state automaton scheme to extract sub-sequences of interest on the fly.

Parameters:
  • validator (callable, DataValidator (must implement is_valid)) – called with each data frame read from source. Should take one positional argument and return True or False for valid and invalid frames respectively.
  • min_length (int) – Minimum number of frames of a valid token. This includes all tolerated non valid frames within the token.
  • max_length (int) – Maximum number of frames of a valid token. This includes all tolerated non valid frames within the token.
  • max_continuous_silence (int) – Maximum number of consecutive non-valid frames within a token. Note that, within a valid token, there may be many tolerated silent regions that contain each a number of non valid frames up to max_continuous_silence
  • init_min (int) – Minimum number of consecutive valid frames that must be initially gathered before any sequence of non valid frames can be tolerated. This option is not always needed, it can be used to drop non-valid tokens as early as possible. Default = 0 means that the option is by default ineffective.
  • init_max_silence (int) – Maximum number of tolerated consecutive non-valid frames if the number already gathered valid frames has not yet reached ‘init_min’.This argument is normally used if init_min is used. Default = 0, by default this argument is not taken into consideration.
  • mode (int) –

    mode can be one of the following:

    -1 StreamTokenizer.NORMAL : do not drop trailing silence, and accept a token shorter than min_length if it is the continuation of the latest delivered token.

    -2 StreamTokenizer.STRICT_MIN_LENGTH: if token i is delivered because max_length is reached, and token i+1 is immediately adjacent to token i (i.e. token i ends at frame k and token i+1 starts at frame k+1) then accept token i+1 only of it has a size of at least min_length. The default behavior is to accept token i+1 event if it is shorter than min_length (provided that the above conditions are fulfilled of course).

    -3 StreamTokenizer.DROP_TRAILING_SILENCE: drop all tailing non-valid frames from a token to be delivered if and only if it is not truncated. This can be a bit tricky. A token is actually delivered if:

    • max_continuous_silence is reached.
    • Its length reaches max_length. This is referred to as a truncated token.

    In the current implementation, a StreamTokenizer’s decision is only based on already seen data and on incoming data. Thus, if a token is truncated at a non-valid but tolerated frame (max_length is reached but max_continuous_silence not yet) any tailing silence will be kept because it can potentially be part of valid token (if max_length was bigger). But if max_continuous_silence is reached before max_length, the delivered token will not be considered as truncated but a result of normal end of detection (i.e. no more valid data). In that case the trailing silence can be removed if you use the StreamTokenizer.DROP_TRAILING_SILENCE mode.

    -4 (StreamTokenizer.STRICT_MIN_LENGTH | StreamTokenizer.DROP_TRAILING_SILENCE): use both options. That means: first remove tailing silence, then check if the token still has a length of at least min_length.

Examples

In the following code, without STRICT_MIN_LENGTH, the ‘BB’ token is accepted although it is shorter than min_length (3), because it immediately follows the latest delivered token:

>>> from auditok.core import StreamTokenizer
>>> from StringDataSource, DataValidator
>>> class UpperCaseChecker(DataValidator):
>>>     def is_valid(self, frame):
            return frame.isupper()
>>> dsource = StringDataSource("aaaAAAABBbbb")
>>> tokenizer = StreamTokenizer(validator=UpperCaseChecker(),
                                min_length=3,
                                max_length=4,
                                max_continuous_silence=0)
>>> tokenizer.tokenize(dsource)
[(['A', 'A', 'A', 'A'], 3, 6), (['B', 'B'], 7, 8)]

The following tokenizer will however reject the ‘BB’ token:

>>> dsource = StringDataSource("aaaAAAABBbbb")
>>> tokenizer = StreamTokenizer(validator=UpperCaseChecker(),
                                min_length=3, max_length=4,
                                max_continuous_silence=0,
                                mode=StreamTokenizer.STRICT_MIN_LENGTH)
>>> tokenizer.tokenize(dsource)
[(['A', 'A', 'A', 'A'], 3, 6)]
>>> tokenizer = StreamTokenizer(
>>>                validator=UpperCaseChecker(),
>>>                min_length=3,
>>>                max_length=6,
>>>                max_continuous_silence=3,
>>>                mode=StreamTokenizer.DROP_TRAILING_SILENCE
>>>                )
>>> dsource = StringDataSource("aaaAAAaaaBBbbbb")
>>> tokenizer.tokenize(dsource)
[(['A', 'A', 'A', 'a', 'a', 'a'], 3, 8), (['B', 'B'], 9, 10)]

The first token is delivered with its tailing silence because it is truncated while the second one has its tailing frames removed.

Without StreamTokenizer.DROP_TRAILING_SILENCE the output would be:

[
    (['A', 'A', 'A', 'a', 'a', 'a'], 3, 8),
    (['B', 'B', 'b', 'b', 'b'], 9, 13)
]
tokenize(data_source, callback=None, generator=False)[source]

Read data from data_source, one frame a time, and process the read frames in order to detect sequences of frames that make up valid tokens.

Parameters:
data_source : instance of the DataSource class that

implements a read method. ‘read’ should return a slice of signal, i.e. frame (of whatever type as long as it can be processed by validator) and None if there is no more signal.

callback : an optional 3-argument function.

If a callback function is given, it will be called each time a valid token is found.

Returns:

A list of tokens if callback is None. Each token is tuple with the following elements:

where data is a list of read frames, start: index of the first frame in the original data and end : index of the last frame.