Read Data and Generate the Schema#

Here, we will cover how to load data and use inferred statistics in Draco.

Available functions#

The main functions allow you to get the schema from a Pandas dataframe or a file. These functions return a schema as a dictionary, which you can encode as Answer Set Programming facts using our generic dict_to_facts encoder.

draco.schema.schema_from_dataframe(df, parse_data_type=dtype_to_field_type)#

Read schema information from the given Pandas dataframe.

Parameters:
  • df (DataFrame) – DataFrame to generate schema for.

  • parse_data_type – Function to parse data types.

Return type:

Schema

Returns:

A dictionary representing the schema.

draco.schema.schema_from_file(file_path, parse_data_type=dtype_to_field_type)#

Read schema information from the given CSV or JSON file.

Parameters:
  • file_path (Path) – Path to CSV or JSON file.

  • parse_data_type – Function to parse data types.

Raises:

ValueError – If the file has an unsupported data type.

Return type:

Schema

Returns:

A dictionary representing the schema.

Usage Example#

from draco import dict_to_facts, schema_from_dataframe

In this example, we use a weather dataset from Vega datasets but this could be any Pandas dataframe.

from vega_datasets import data

df = data.seattle_weather()

We can then call schema_from_dataframe to get schema information from the pandas dataframe. The schema information is a dictionary.

schema = schema_from_dataframe(df)
schema
{'number_rows': 1461,
 'field': [{'name': 'date',
   'type': 'datetime',
   'unique': 1461,
   'entropy': 7287},
  {'name': 'precipitation',
   'type': 'number',
   'unique': 111,
   'entropy': 2422,
   'min': 0,
   'max': 55,
   'std': 6},
  {'name': 'temp_max',
   'type': 'number',
   'unique': 67,
   'entropy': 3934,
   'min': -1,
   'max': 35,
   'std': 7},
  {'name': 'temp_min',
   'type': 'number',
   'unique': 55,
   'entropy': 3596,
   'min': -7,
   'max': 18,
   'std': 5},
  {'name': 'wind',
   'type': 'number',
   'unique': 79,
   'entropy': 3950,
   'min': 0,
   'max': 9,
   'std': 1},
  {'name': 'weather',
   'type': 'string',
   'unique': 5,
   'entropy': 1201,
   'freq': np.int64(714)}]}

We can then convert the schema dictionary into facts that Dracos constraint solver can use with dict_to_facts. The function returns a list of facts. The solver will be able to parse these facts and consider them in the recommendation process.

dict_to_facts(schema)
['attribute(number_rows,root,1461).',
 'entity(field,root,0).',
 'attribute((field,name),0,date).',
 'attribute((field,type),0,datetime).',
 'attribute((field,unique),0,1461).',
 'attribute((field,entropy),0,7287).',
 'entity(field,root,1).',
 'attribute((field,name),1,precipitation).',
 'attribute((field,type),1,number).',
 'attribute((field,unique),1,111).',
 'attribute((field,entropy),1,2422).',
 'attribute((field,min),1,0).',
 'attribute((field,max),1,55).',
 'attribute((field,std),1,6).',
 'entity(field,root,2).',
 'attribute((field,name),2,temp_max).',
 'attribute((field,type),2,number).',
 'attribute((field,unique),2,67).',
 'attribute((field,entropy),2,3934).',
 'attribute((field,min),2,-1).',
 'attribute((field,max),2,35).',
 'attribute((field,std),2,7).',
 'entity(field,root,3).',
 'attribute((field,name),3,temp_min).',
 'attribute((field,type),3,number).',
 'attribute((field,unique),3,55).',
 'attribute((field,entropy),3,3596).',
 'attribute((field,min),3,-7).',
 'attribute((field,max),3,18).',
 'attribute((field,std),3,5).',
 'entity(field,root,4).',
 'attribute((field,name),4,wind).',
 'attribute((field,type),4,number).',
 'attribute((field,unique),4,79).',
 'attribute((field,entropy),4,3950).',
 'attribute((field,min),4,0).',
 'attribute((field,max),4,9).',
 'attribute((field,std),4,1).',
 'entity(field,root,5).',
 'attribute((field,name),5,weather).',
 'attribute((field,type),5,string).',
 'attribute((field,unique),5,5).',
 'attribute((field,entropy),5,1201).',
 'attribute((field,freq),5,714).']