Units#

Defined in: voxatlas.units.units

class voxatlas.units.units.Units(frames=None, tokens=None, phonemes=None, syllables=None, sentences=None, words=None, ipus=None, turns=None, speaker=None)[source]#

Bases: object

Container for hierarchical speech units (tables) for a single stream.

VoxAtlas feature extractors operate on unit tables (frames, tokens, phonemes, syllables, words, etc.) that are time-aligned and optionally linked through parent-child identifiers. Units is a lightweight, backend-agnostic wrapper around those tables: it stores them, normalizes unit type names (singular/plural aliases), and provides a small set of convenience accessors (lookup, durations, parent/children grouping).

This class intentionally does not enforce a rigid schema beyond what its helper methods require; extractors may expect additional columns such as label or token depending on the feature.

Parameters:

frames (pandas.DataFrame | None) – Frame-level table.
tokens (pandas.DataFrame | None) – Token-level table.
phonemes (pandas.DataFrame | None) – Phoneme-level table.
syllables (pandas.DataFrame | None) – Syllable-level table.
sentences (pandas.DataFrame | None) – Sentence-level table.
words (pandas.DataFrame | None) – Word-level table.
ipus (pandas.DataFrame | None) – Inter-pausal-unit table.
turns (pandas.DataFrame | None) – Turn-level table.
speaker (str | None) – Optional speaker label for the stream.

Returns:

Hierarchical unit container for one stream.

Return type:

Units

frames, tokens, phonemes, syllables, sentences, words, ipus, turns

Stored unit tables. Any table may be None if it is unavailable.

Type:: pandas.DataFrame | None

speaker#

Speaker label associated with this stream (if known).

Type:: str | None

Notes

Unit labels Methods that accept a unit_type (for example, table()) accept both singular and plural labels:

"frame" / "frames"
"token" / "tokens"
"phoneme" / "phonemes"
"syllable" / "syllables"
"sentence" / "sentences"
"word" / "words"
"ipu" / "ipus"
"turn" / "turns"

Table conventions Units works best when each DataFrame follows a few simple conventions:

id: unique identifier for the unit row (typically integer-like).
start and end: segment boundaries on a shared timeline (commonly seconds). Used by duration() and by many extractors.
Parent-child links (optional): to connect units explicitly, include an <parent>_id column on the child table. For example, syllables that belong to words can carry a word_id column; phonemes that belong to syllables can carry a syllable_id column. parent() and children() use this naming convention.
table() returns the underlying DataFrame object. If you mutate it, you are mutating the table stored on the Units instance.
If a requested table is missing (None), table() raises ValueError; callers can either catch this or check the relevant attribute first.

Examples

>>> import pandas as pd
>>> from voxatlas.units import Units
>>> words = pd.DataFrame({"id": [1], "start": [0.0], "end": [1.0], "label": ["hello"]})
>>> syllables = pd.DataFrame(
...     {"id": [10], "word_id": [1], "start": [0.0], "end": [0.5], "label": ["he"]}
... )
>>> units = Units(words=words, syllables=syllables, speaker="A")
>>> units.table("word").shape
(1, 4)
>>> float(units.duration("word").iloc[0])
1.0

table(unit_type)[source]#

Return the table for a requested unit type.

Parameters:: unit_type (str) – Unit label such as "token" or "syllable".
Returns:: Table associated with the requested unit type.
Return type:: pandas.DataFrame
Raises:: ValueError – Raised when the unit type is invalid or unavailable.

Examples

>>> import pandas as pd
>>> from voxatlas.units import Units
>>> tokens = pd.DataFrame({"id": [1], "start": [0.0], "end": [0.2], "label": ["hi"]})
>>> units = Units(tokens=tokens)
>>> units.table("token").columns.tolist()
['id', 'start', 'end', 'label']

get(unit_type)[source]#

Alias for table().

Parameters:: unit_type (str) – Requested unit label.
Returns:: Requested unit table.
Return type:: pandas.DataFrame

Examples

>>> import pandas as pd
>>> from voxatlas.units import Units
>>> words = pd.DataFrame({"id": [1], "start": [0.0], "end": [1.0], "label": ["hello"]})
>>> units = Units(words=words)
>>> units.get("word").shape[0]
1

duration(unit_type)[source]#

Compute durations from start and end columns.

Parameters:: unit_type (str) – Requested unit label.
Returns:: Duration values for each row.
Return type:: pandas.Series

Examples

>>> import pandas as pd
>>> from voxatlas.units import Units
>>> words = pd.DataFrame({"id": [1], "start": [0.25], "end": [1.00], "label": ["hello"]})
>>> units = Units(words=words)
>>> float(units.duration("word").iloc[0])
0.75

parent(child_type, parent_type)[source]#

Return parent identifiers for a child unit table.

Parameters:

child_type (str) – Child unit label.
parent_type (str) – Parent unit label.

Returns:

Parent identifier column.

Return type:

pandas.Series

Raises:

ValueError – Raised when the mapping column is unavailable.

Examples

>>> import pandas as pd
>>> from voxatlas.units import Units
>>> words = pd.DataFrame({"id": [1], "start": [0.0], "end": [1.0], "label": ["hello"]})
>>> syllables = pd.DataFrame(
...     {"id": [10, 11], "word_id": [1, 1], "start": [0.0, 0.5], "end": [0.5, 1.0], "label": ["he", "llo"]}
... )
>>> units = Units(words=words, syllables=syllables)
>>> units.parent("syllable", "word").tolist()
[1, 1]

children(parent_type, child_type)[source]#

Group child units by parent identifier.

Parameters:

parent_type (str) – Parent unit label.
child_type (str) – Child unit label.

Returns:

Grouped child table keyed by parent identifier.

Return type:

DataFrameGroupBy

Raises:

ValueError – Raised when the mapping column is unavailable.

Examples

>>> import pandas as pd
>>> from voxatlas.units import Units
>>> words = pd.DataFrame({"id": [1], "start": [0.0], "end": [1.0], "label": ["hello"]})
>>> phonemes = pd.DataFrame(
...     {"id": [100, 101], "word_id": [1, 1], "start": [0.0, 0.5], "end": [0.5, 1.0], "label": ["h", "i"]}
... )
>>> units = Units(words=words, phonemes=phonemes)
>>> units.children("word", "phoneme").ngroups
1

group(child_type, by)[source]#

Alias for children() using by as the parent unit.

Parameters:

child_type (str) – Child unit label.
by (str) – Parent unit label.

Returns:

Grouped child table.

Return type:

DataFrameGroupBy

Examples

>>> import pandas as pd
>>> from voxatlas.units import Units
>>> words = pd.DataFrame({"id": [1], "start": [0.0], "end": [1.0], "label": ["hello"]})
>>> phonemes = pd.DataFrame(
...     {"id": [100, 101], "word_id": [1, 1], "start": [0.0, 0.5], "end": [0.5, 1.0], "label": ["h", "i"]}
... )
>>> units = Units(words=words, phonemes=phonemes)
>>> units.group("phoneme", by="word").ngroups
1