Feature System#

VoxAtlas features are implemented as small, composable extractors that share a common contract. The registry and pipeline use this contract to discover features, validate metadata (names/units/dependencies), and execute extractors consistently.

The feature system centers on:

extractor classes that implement one feature
structured feature inputs (audio, units, shared context)
typed feature outputs (scalar/vector/matrix/table/array containers)
registry metadata used for discovery, validation, and dependency planning

Extractor contract (what you implement)#

All extractors inherit from BaseExtractor and typically define:

name (required): fully-qualified feature name like "acoustic.pitch.f0"
input_units / output_units (optional): unit labels such as "token" or "frame" (or None for audio/global features)
dependencies (optional): upstream feature names that must run first
default_config (optional): per-feature default parameters
compute(feature_input, params) (required): returns a structured feature output

Extractors should be stateless. If you need upstream results, read them from feature_input.context["feature_store"] rather than storing them on the extractor instance.

Feature inputs (what you receive)#

Each extractor invocation receives a FeatureInput bundle:

feature_input.audio: Audio or None
feature_input.units: Units or None
feature_input.context: shared runtime dictionary (pipeline config + feature store)

The pipeline stores the runtime config and the feature store in the context:

store = feature_input.context["feature_store"]
upstream = store.get("syntax.dependencies")
params = feature_input.context["config"]

Feature outputs (what you return)#

VoxAtlas standardizes common output shapes in voxatlas.features.feature_output:

ScalarFeatureOutput: one scalar per unit
VectorFeatureOutput: time-aligned 1D sequence
MatrixFeatureOutput: time-frequency matrix
TableFeatureOutput: tabular output (DataFrame)
ArrayFeatureOutput: raw NumPy array

Most extractors should return one of these dataclasses so downstream consumers and writers can handle outputs uniformly.

Registry + discovery (how features become runnable)#

VoxAtlas uses a global FeatureRegistry instance to map feature names to extractor classes and metadata.

Registration: feature modules typically call registry.register(MyExtractor) at import time.
Discovery: voxatlas.core.discovery.discover_features() walks the voxatlas.features package and imports modules to trigger registrations.
Optional dependencies: if importing a feature module fails due to a missing third-party dependency, discovery records an unavailable registry entry (name, units, dependencies, missing dependency) so the CLI can still report it.

Configuration and parameters#

The pipeline resolves per-feature parameters by merging:

an extractor’s default_config (if provided), with
user overrides under config["feature_config"][<feature_name>]

See voxatlas.config.feature_config.resolve_feature_config() for the exact merge behavior.

Units and alignment#

Unit labels declared on extractors are validated when the extractor is registered (for example "token", "word", "frame", or "conversation"). For how unit tables are represented at runtime, see Unit Hierarchy.

At execution time, extractors should still check that required unit tables and columns exist for the current dataset/stream and raise clear errors when they do not.

Useful API pages#

voxatlas.features.base_extractor
voxatlas.features.feature_input
voxatlas.features.feature_output
FeatureRegistry
voxatlas.core.discovery.discover_features()

Where to go next#

Writing Extractors for a step-by-step extractor tutorial
Pipeline for how dependencies, parallelism, and caching are executed