Read Data and Generate the Schema#
Here, we will cover how to load data and use inferred statistics in Draco.
Available functions#
The main functions allow you to get the schema from a Pandas dataframe or a file. These functions return a schema as a dictionary, which you can encode as Answer Set Programming facts using our generic dict_to_facts
encoder.
- draco.schema.schema_from_dataframe(df, parse_data_type=dtype_to_field_type)#
Read schema information from the given Pandas dataframe.
- Parameters:
df (
DataFrame
) – DataFrame to generate schema for.parse_data_type – Function to parse data types.
- Return type:
Schema
- Returns:
A dictionary representing the schema.
- draco.schema.schema_from_file(file_path, parse_data_type=dtype_to_field_type)#
Read schema information from the given CSV or JSON file.
- Parameters:
file_path (
Path
) – Path to CSV or JSON file.parse_data_type – Function to parse data types.
- Raises:
ValueError – If the file has an unsupported data type.
- Return type:
Schema
- Returns:
A dictionary representing the schema.
Usage Example#
from draco import dict_to_facts, schema_from_dataframe
In this example, we use a weather dataset from Vega datasets but this could be any Pandas dataframe.
from vega_datasets import data
df = data.seattle_weather()
We can then call schema_from_dataframe
to get schema information from the pandas dataframe. The schema information is a dictionary.
schema = schema_from_dataframe(df)
schema
{'number_rows': 1461,
'field': [{'name': 'date',
'type': 'datetime',
'unique': 1461,
'entropy': 7287},
{'name': 'precipitation',
'type': 'number',
'unique': 111,
'entropy': 2422,
'min': 0,
'max': 55,
'std': 6},
{'name': 'temp_max',
'type': 'number',
'unique': 67,
'entropy': 3934,
'min': -1,
'max': 35,
'std': 7},
{'name': 'temp_min',
'type': 'number',
'unique': 55,
'entropy': 3596,
'min': -7,
'max': 18,
'std': 5},
{'name': 'wind',
'type': 'number',
'unique': 79,
'entropy': 3950,
'min': 0,
'max': 9,
'std': 1},
{'name': 'weather',
'type': 'string',
'unique': 5,
'entropy': 1201,
'freq': np.int64(714)}]}
We can then convert the schema dictionary into facts that Dracos constraint solver can use with dict_to_facts
. The function returns a list of facts. The solver will be able to parse these facts and consider them in the recommendation process.
dict_to_facts(schema)
['attribute(number_rows,root,1461).',
'entity(field,root,0).',
'attribute((field,name),0,date).',
'attribute((field,type),0,datetime).',
'attribute((field,unique),0,1461).',
'attribute((field,entropy),0,7287).',
'entity(field,root,1).',
'attribute((field,name),1,precipitation).',
'attribute((field,type),1,number).',
'attribute((field,unique),1,111).',
'attribute((field,entropy),1,2422).',
'attribute((field,min),1,0).',
'attribute((field,max),1,55).',
'attribute((field,std),1,6).',
'entity(field,root,2).',
'attribute((field,name),2,temp_max).',
'attribute((field,type),2,number).',
'attribute((field,unique),2,67).',
'attribute((field,entropy),2,3934).',
'attribute((field,min),2,-1).',
'attribute((field,max),2,35).',
'attribute((field,std),2,7).',
'entity(field,root,3).',
'attribute((field,name),3,temp_min).',
'attribute((field,type),3,number).',
'attribute((field,unique),3,55).',
'attribute((field,entropy),3,3596).',
'attribute((field,min),3,-7).',
'attribute((field,max),3,18).',
'attribute((field,std),3,5).',
'entity(field,root,4).',
'attribute((field,name),4,wind).',
'attribute((field,type),4,number).',
'attribute((field,unique),4,79).',
'attribute((field,entropy),4,3950).',
'attribute((field,min),4,0).',
'attribute((field,max),4,9).',
'attribute((field,std),4,1).',
'entity(field,root,5).',
'attribute((field,name),5,weather).',
'attribute((field,type),5,string).',
'attribute((field,unique),5,5).',
'attribute((field,entropy),5,1201).',
'attribute((field,freq),5,714).']