Skip to content

df_schema() gives a column-by-column description of a data frame. For each column, it gives the name, type, label (if present), and number of missing values. For numeric and date/time columns, it also gives the range. For character and factor columns, it also gives the number of unique values, and if there's only a few (<= 10), their values.

The goal is to give the LLM a sense of the structure of the data, so that it can generate useful code, and the output attempts to balance between conciseness and accuracy.

Usage

df_schema(df, max_cols = 50)

Arguments

df

A data frame to describe.

max_cols

Maximum number of columns to includes. Defaults to 50 to avoid accidentally generating very large prompts.

Examples

df_schema(mtcars)
#> [1] │ A data frame with 32 rows and 11 columns:
#>      * mpg: numeric with range [10.4, 33.9], and 0 NAs
#>      * cyl: numeric with range [4, 8], and 0 NAs
#>      * disp: numeric with range [71.1, 472], and 0 NAs
#>      * hp: numeric with range [52, 335], and 0 NAs
#>      * drat: numeric with range [2.76, 4.93], and 0 NAs
#>      * wt: numeric with range [1.513, 5.424], and 0 NAs
#>      * qsec: numeric with range [14.5, 22.9], and 0 NAs
#>      * vs: numeric with range [0, 1], and 0 NAs
#>      * am: numeric with range [0, 1], and 0 NAs
#>      * gear: numeric with range [3, 5], and 0 NAs
#>      * carb: numeric with range [1, 8], and 0 NAs
df_schema(iris)
#> [1] │ A data frame with 150 rows and 5 columns:
#>      * Sepal.Length: numeric with range [4.3, 7.9], and 0 NAs
#>      * Sepal.Width: numeric with range [2, 4.4], and 0 NAs
#>      * Petal.Length: numeric with range [1, 6.9], and 0 NAs
#>      * Petal.Width: numeric with range [0.1, 2.5], and 0 NAs
#>      * Species: nominal with 0 NAs, and 3 permitted values ("setosa", "versicolor", "virginica")