Skip to content

Table Configuration

The previous section describes how to configure the Texture interface. This section describes how to configure the data that is passed to the interface.

Texture splits data into multiple tables to enable attributes with many to one relationships with each document. Texture categorizes metadata into the types in the following table.

Data TypeDescriptionExample
TextThe text documentsPaper abstracts, social media posts, news articles
Metadata: single-valueMetadata attribute that only has one value per document. Number, categorical, or temporalDocument rating score, topic, publication date
Metadata: hierarchical segmentMetadata with multiple values per document where each value maps directly to a segment of the textWords, tokens, phrases, POS tags, NER tags
Metadata: hierarchical non-segmentMetadata with multiple values per document that do not map directly to the textList of authors, list of keywords
Vector EmbeddingsAny high dimensional document embedding + a 2D projectionGPT3 document embeddings with 2D UMAP projection

Formatting your data into multiple tables

Hierarchical attributes are separated into multiple tables (multiple Pandas DataFrames in Python) to avoid arrays in the dataframe. This is equivalent to putting the data into first normal form in database terminology.

For example, the following figure shows how we can split data with arrays for the word, POS (part of speech), and author attributes into new tables:

Multiple Tables

When creating our DatasetSchema, we then specify which table each attribute comes from:

python
import pandas as pd
import texture
from texture.models import DatasetSchema, Column, DerivedSchema

load_tables = {
    "main": df_main,
    "document_words": df_words,
    "document_authors": df_authors,
}

schema = DatasetSchema(
    name="main",
    columns=[
        # columns from table "main"
        Column(name="document", type="text"),
        Column(name="topic", type="categorical"),

        # columns from table "document_words"
        Column(
            name="word",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=True,
                table_name="document_words",
                derived_from="document",
            ),
        ),
        Column(
            name="POS",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=True,
                table_name="words_table",
                derived_from="Abstract",
            ),
        ),

        # columns from table "document_authors"
        Column(
            name="author",
            type="categorical",
            derivedSchema=DerivedSchema(
                is_segment=False,
                table_name="authors_table",
            ),
        ),
    ],
    primary_key=Column(name="id", type="number"),
)

Required Columns

Main table

There are several reserved column names in the main table that are used in the interface.

  • id: A unique identifier for each row.
  • vector: (Optional) A column containing embeddings for the text data.
  • umap_x and umap_y: (Optional) Columns containing 2d projections of the embeddings.

Hierarchical attribute tables

Likewise, each attribute table expects a table with the following columns:

  • id: A foreign key to the main table id.
  • <col_name>: The categorical or quantitative data attribute.
  • span_start: The start index of the segment in the text.
  • span_end: The end index of the segment in the text.

INFO

Even for list attributes like authors or keywords, the table is still required to have the span_start and span_end columns. The values for these columns can be set to the array indices.