auditok.core.split(input, min_dur=0.2, max_dur=5, max_silence=0.3, drop_trailing_silence=False, strict_min_dur=False, **kwargs)[source]

Split audio data and return a generator of AudioRegions

  • input (str, bytes, AudioSource, AudioReader, AudioRegion or None) – input audio data. If str, it should be a path to an existing audio file. “-” is interpreted as standard input. If bytes, input is considered as raw audio data. If None, read audio from microphone. Every object that is not an AudioReader will be transformed into an AudioReader before processing. If it is an str that refers to a raw audio file, bytes or None, audio parameters should be provided using kwargs (i.e., samplig_rate, sample_width and channels or their alias). If input is str then audio format will be guessed from file extension. audio_format (alias fmt) kwarg can also be given to specify audio format explicitly. If none of these options is available, rely on backend (currently only pydub is supported) to load data.
  • min_dur (float, default: 0.2) – minimun duration in seconds of a detected audio event. By using large values for min_dur, very short audio events (e.g., very short 1-word utterances like ‘yes’ or ‘no’) can be mis detected. Using very short values might result in a high number of short, unuseful audio events.
  • max_dur (float, default: 5) – maximum duration in seconds of a detected audio event. If an audio event lasts more than max_dur it will be truncated. If the continuation of a truncated audio event is shorter than min_dur then this continuation is accepted as a valid audio event if strict_min_dur is False. Otherwise it is rejected.
  • max_silence (float, default: 0.3) – maximum duration of continuous silence within an audio event. There might be many silent gaps of this duration within one audio event. If the continuous silence happens at the end of the event than it’s kept as part of the event if drop_trailing_silence is False (default).
  • drop_trailing_silence (bool, default: False) – Whether to remove trailing silence from detected events. To avoid abrupt cuts in speech, trailing silence should be kept, therefore this parameter should be False.
  • strict_min_dur (bool, default: False) – strict minimum duration. Do not accept an audio event if it is shorter than min_dur even if it is contiguous to the latest valid event. This happens if the the latest detected event had reached max_dur.
Other Parameters:
  • analysis_window, aw (float, default: 0.05 (50 ms)) – duration of analysis window in seconds. A value between 0.01 (10 ms) and 0.1 (100 ms) should be good for most use-cases.
  • audio_format, fmt (str) – type of audio data (e.g., wav, ogg, flac, raw, etc.). This will only be used if input is a string path to an audio file. If not given, audio type will be guessed from file name extension or from file header.
  • sampling_rate, sr (int) – sampling rate of audio data. Required if input is a raw audio file, is a bytes object or None (i.e., read from microphone).
  • sample_width, sw (int) – number of bytes used to encode one audio sample, typically 1, 2 or 4. Required for raw data, see sampling_rate.
  • channels, ch (int) – number of channels of audio data. Required for raw data, see sampling_rate.
  • use_channel, uc ({None, “mix”} or int) – which channel to use for split if input has multiple audio channels. Regardless of which channel is used for splitting, returned audio events contain data from all channels, just as input. The following values are accepted:
    • None (alias “any”): accept audio activity from any channel, even if other channels are silent. This is the default behavior.
    • “mix” (“avg” or “average”): mix down all channels (i.e. compute average channel) and split the resulting channel.
    • int (0 <=, > channels): use one channel, specified by integer id, for split.
  • large_file (bool, default: False) – If True, AND if input is a path to a wav of a raw audio file (and only these two formats) then audio data is lazily loaded to memory (i.e., one analysis window a time). Otherwise the whole file is loaded to memory before split. Set to True if the size of the file is larger than available memory.
  • max_read, mr (float, default: None, read until end of stream) – maximum data to read from source in seconds.
  • validator, val (callable, DataValidator) – custom data validator. If None (default), an AudioEnergyValidor is used with the given energy threshold. Can be a callable or an instance of DataValidator that implements is_valid. In either case, it’ll be called with with a window of audio data as the first parameter.
  • energy_threshold, eth (float, default: 50) – energy threshold for audio activity detection. Audio regions that have enough windows of with a signal energy equal to or above this threshold are considered valid audio events. Here we are referring to this amount as the energy of the signal but to be more accurate, it is the log energy of computed as: 20 * log10(sqrt(dot(x, x) / len(x))) (see AudioEnergyValidator and calculate_energy_single_channel()). If validator is given, this argument is ignored.

AudioRegion – a generator of detected AudioRegion s.