lux.core.frame.LuxDataFrame

class lux.core.frame.LuxDataFrame(*args, **kw)[source]

A subclass of pd.DataFrame that supports all dataframe operations while housing other variables and functions for generating visual recommendations.

__init__(*args, **kw)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(*args, **kw) Initialize self.
abs() Return a Series/DataFrame with absolute numeric value of each element.
add(other[, axis, level, fill_value]) Get Addition of dataframe and other, element-wise (binary operator add).
add_prefix(prefix) Prefix labels with string prefix.
add_suffix(suffix) Suffix labels with string suffix.
agg([func, axis]) Aggregate using one or more operations over the specified axis.
aggregate([func, axis]) Aggregate using one or more operations over the specified axis.
align(other[, join, axis, level, copy, …]) Align two objects on their axes with the specified join method.
all([axis, bool_only, skipna, level]) Return whether all elements are True, potentially over an axis.
any([axis, bool_only, skipna, level]) Return whether any element is True, potentially over an axis.
append(other[, ignore_index, …]) Append rows of other to the end of caller, returning a new object.
apply(func[, axis, raw, result_type, args]) Apply a function along an axis of the DataFrame.
applymap(func, na_action) Apply a function to a Dataframe elementwise.
asfreq(freq[, method, fill_value]) Convert TimeSeries to specified frequency.
asof(where[, subset]) Return the last row(s) without any NaNs before where.
assign(**kwargs) Assign new columns to a DataFrame.
astype(dtype, copy, errors) Cast a pandas object to a specified dtype dtype.
at_time(time, asof[, axis]) Select values at particular time of day (e.g., 9:30AM).
backfill([axis, limit, downcast]) Synonym for DataFrame.fillna() with method='bfill'.
between_time(start_time, end_time, …[, axis]) Select values between particular times of the day (e.g., 9:00-9:30 AM).
bfill([axis, limit, downcast]) Synonym for DataFrame.fillna() with method='bfill'.
bool() Return the bool of a single element Series or DataFrame.
boxplot([column, by, ax, fontsize, rot, …]) Make a box plot from DataFrame columns.
clear_intent()
clip([lower, upper, axis]) Trim values at input threshold(s).
combine(other, func[, fill_value, overwrite]) Perform column-wise combine with another DataFrame.
combine_first(other) Update null elements with value in the same location in other.
compare(other, align_axis, int] = 1, …) Compare to another DataFrame and show the differences.
compute_SQL_data_type()
compute_SQL_dataset_metadata()
compute_SQL_stats()
convert_dtypes(infer_objects, …) Convert columns to best possible dtypes using dtypes supporting pd.NA.
copy(deep) Make a copy of this object’s indices and data.
copy_intent()
corr([method, min_periods]) Compute pairwise correlation of columns, excluding NA/null values.
corrwith(other[, axis, drop, method]) Compute pairwise correlation.
count([axis, level, numeric_only]) Count non-NA cells for each column or row.
cov(min_periods, ddof) Compute pairwise covariance of columns, excluding NA/null values.
cummax([axis, skipna]) Return cumulative maximum over a DataFrame or Series axis.
cummin([axis, skipna]) Return cumulative minimum over a DataFrame or Series axis.
cumprod([axis, skipna]) Return cumulative product over a DataFrame or Series axis.
cumsum([axis, skipna]) Return cumulative sum over a DataFrame or Series axis.
current_vis_to_JSON(vlist[, input_current_vis])
describe(*args, **kwargs) Generate descriptive statistics.
diff(periods, axis, int] = 0) First discrete difference of element.
display_pandas()
div(other[, axis, level, fill_value]) Get Floating division of dataframe and other, element-wise (binary operator truediv).
divide(other[, axis, level, fill_value]) Get Floating division of dataframe and other, element-wise (binary operator truediv).
dot(other) Compute the matrix multiplication between the DataFrame and other.
drop([labels, axis, index, columns, level, …]) Drop specified labels from rows or columns.
drop_duplicates(subset, Sequence[Hashable], …) Return DataFrame with duplicate rows removed.
droplevel(level[, axis]) Return DataFrame with requested index / column level(s) removed.
dropna([axis, how, thresh, subset, inplace]) Remove missing values.
duplicated(subset, Sequence[Hashable], …) Return boolean Series denoting duplicate rows.
eq(other[, axis, level]) Get Equal to of dataframe and other, element-wise (binary operator eq).
equals(other) Test whether two objects contain the same elements.
eval(expr[, inplace]) Evaluate a string describing operations on DataFrame columns.
ewm(com, span, halflife, Timedelta, …) Provide exponential weighted (EW) functions.
expanding(min_periods, center, axis, int] = 0) Provide expanding transformations.
expire_metadata() Expire all saved metadata to trigger a recomputation the next time the data is required.
expire_recs() Expires and resets all recommendations
explode(column, Tuple], ignore_index) Transform each element of a list-like to a row, replicating index values.
ffill([axis, limit, downcast]) Synonym for DataFrame.fillna() with method='ffill'.
fillna([value, method, axis, inplace, …]) Fill NA/NaN values using the specified method.
filter([items, axis]) Subset the dataframe rows or columns according to the specified index labels.
first(offset) Select initial periods of time series data based on a date offset.
first_valid_index() Return index for first non-NA/null value.
floordiv(other[, axis, level, fill_value]) Get Integer division of dataframe and other, element-wise (binary operator floordiv).
from_dict(data[, orient, dtype, columns]) Construct DataFrame from dict of array-like or dicts.
from_records(data[, index, exclude, …]) Convert structured or record ndarray to DataFrame.
ge(other[, axis, level]) Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).
get(key[, default]) Get item from object for given key (ex: DataFrame column).
get_SQL_attributes()
get_SQL_cardinality()
get_SQL_unique_values()
groupby(*args, **kwargs) Group DataFrame using a mapper or by a Series of columns.
gt(other[, axis, level]) Get Greater than of dataframe and other, element-wise (binary operator gt).
head(n) Return the first n rows.
hist(column, None, …[, by, ax]) Make a histogram of the DataFrame’s.
idxmax([axis, skipna]) Return index of first occurrence of maximum over requested axis.
idxmin([axis, skipna]) Return index of first occurrence of minimum over requested axis.
infer_objects() Attempt to infer better dtypes for object columns.
info(*args, **kwargs) Print a concise summary of a DataFrame.
insert(loc, column, value[, allow_duplicates]) Insert column into DataFrame at specified location.
intent_to_JSON(intent)
intent_to_string(intent)
interpolate(method, axis, int] = 0, limit, …) Fill NaN values using an interpolation method.
isin(values) Whether each element in the DataFrame is contained in values.
isna() Detect missing values.
isnull() Detect missing values.
items() Iterate over (column name, Series) pairs.
iteritems() Iterate over (column name, Series) pairs.
iterrows() Iterate over DataFrame rows as (index, Series) pairs.
itertuples(index, name) Iterate over DataFrame rows as namedtuples.
join(other[, on, how, lsuffix, rsuffix, sort]) Join columns of another DataFrame.
keys() Get the ‘info axis’ (see Indexing for more).
kurt([axis, skipna, level, numeric_only]) Return unbiased kurtosis over requested axis.
kurtosis([axis, skipna, level, numeric_only]) Return unbiased kurtosis over requested axis.
last(offset) Select final periods of time series data based on a date offset.
last_valid_index() Return index for last non-NA/null value.
le(other[, axis, level]) Get Less than or equal to of dataframe and other, element-wise (binary operator le).
lookup(row_labels, col_labels) Label-based “fancy indexing” function for DataFrame.
lt(other[, axis, level]) Get Less than of dataframe and other, element-wise (binary operator lt).
mad([axis, skipna, level]) Return the mean absolute deviation of the values over the requested axis.
maintain_metadata()
maintain_recs()
mask(cond[, other, inplace, axis, level, …]) Replace values where the condition is True.
max([axis, skipna, level, numeric_only]) Return the maximum of the values over the requested axis.
mean([axis, skipna, level, numeric_only]) Return the mean of the values over the requested axis.
median([axis, skipna, level, numeric_only]) Return the median of the values over the requested axis.
melt([id_vars, value_vars, var_name, …]) Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
memory_usage([index, deep]) Return the memory usage of each column in bytes.
merge(right[, how, on, left_on, right_on, …]) Merge DataFrame or named Series objects with a database-style join.
min([axis, skipna, level, numeric_only]) Return the minimum of the values over the requested axis.
mod(other[, axis, level, fill_value]) Get Modulo of dataframe and other, element-wise (binary operator mod).
mode([axis, numeric_only, dropna]) Get the mode(s) of each element along the selected axis.
mul(other[, axis, level, fill_value]) Get Multiplication of dataframe and other, element-wise (binary operator mul).
multiply(other[, axis, level, fill_value]) Get Multiplication of dataframe and other, element-wise (binary operator mul).
ne(other[, axis, level]) Get Not equal to of dataframe and other, element-wise (binary operator ne).
nlargest(n, columns[, keep]) Return the first n rows ordered by columns in descending order.
notna() Detect existing (non-missing) values.
notnull() Detect existing (non-missing) values.
nsmallest(n, columns[, keep]) Return the first n rows ordered by columns in ascending order.
nunique([axis, dropna]) Count distinct observations over requested axis.
pad([axis, limit, downcast]) Synonym for DataFrame.fillna() with method='ffill'.
pct_change([periods, fill_method, limit, freq]) Percentage change between the current and a prior element.
pipe(func, *args, **kwargs) Apply func(self, *args, **kwargs).
pivot([index, columns, values]) Return reshaped DataFrame organized by given index / column values.
pivot_table([values, index, columns, …]) Create a spreadsheet-style pivot table as a DataFrame.
pop(item) Return item and drop from frame.
pow(other[, axis, level, fill_value]) Get Exponential power of dataframe and other, element-wise (binary operator pow).
prod([axis, skipna, level, numeric_only, …]) Return the product of the values over the requested axis.
product([axis, skipna, level, numeric_only, …]) Return the product of the values over the requested axis.
quantile([q, axis, numeric_only, interpolation]) Return values at the given quantile over requested axis.
query(expr[, inplace]) Query the columns of a DataFrame with a boolean expression.
radd(other[, axis, level, fill_value]) Get Addition of dataframe and other, element-wise (binary operator radd).
rank([axis]) Compute numerical data ranks (1 through n) along axis.
rdiv(other[, axis, level, fill_value]) Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
rec_to_JSON(recs)
reindex([labels, index, columns, axis, …]) Conform Series/DataFrame to new index with optional filling logic.
reindex_like(other, method, copy[, limit, …]) Return an object with matching indices as other object.
remove_deleted_recs(change)
rename([mapper, index, columns, axis, copy, …]) Alter axes labels.
rename_axis([mapper, index, columns, axis, …]) Set the name of the axis for the index or columns.
render_widget(renderer[, input_current_vis]) Generate a LuxWidget based on the LuxDataFrame
reorder_levels(order[, axis]) Rearrange index levels using input order.
replace([to_replace, value, inplace, limit, …]) Replace values given in to_replace with value.
resample(rule[, axis, loffset, on, level]) Resample time-series data.
reset_index(level, Sequence[Hashable], …) Reset the index, or a level of it.
rfloordiv(other[, axis, level, fill_value]) Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).
rmod(other[, axis, level, fill_value]) Get Modulo of dataframe and other, element-wise (binary operator rmod).
rmul(other[, axis, level, fill_value]) Get Multiplication of dataframe and other, element-wise (binary operator rmul).
rolling(window, timedelta, BaseOffset, …) Provide rolling window calculations.
round([decimals]) Round a DataFrame to a variable number of decimal places.
rpow(other[, axis, level, fill_value]) Get Exponential power of dataframe and other, element-wise (binary operator rpow).
rsub(other[, axis, level, fill_value]) Get Subtraction of dataframe and other, element-wise (binary operator rsub).
rtruediv(other[, axis, level, fill_value]) Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
sample([n, frac, replace, weights, …]) Return a random sample of items from an axis of object.
save_as_html(filename) Save dataframe widget as static HTML file
select_dtypes([include, exclude]) Return a subset of the DataFrame’s columns based on the column dtypes.
sem([axis, skipna, level, ddof, numeric_only]) Return unbiased standard error of the mean over requested axis.
set_SQL_table(t_name)
set_axis(labels, axis, int] = 0, inplace) Assign desired index to given axis.
set_data_type(types) Set the data type for a particular attribute in the dataframe overriding the automatically-detected type inferred by Lux
set_index(keys[, drop, append, inplace, …]) Set the DataFrame index using existing columns.
set_intent(intent, lux.vis.Clause.Clause]])
set_intent_as_vis(vis) Set intent of the dataframe based on the intent of a Vis
set_intent_on_click(change)
shift([periods, freq, axis, fill_value]) Shift index by desired number of periods with an optional time freq.
skew([axis, skipna, level, numeric_only]) Return unbiased skew over requested axis.
slice_shift(periods[, axis]) Equivalent to shift without copying data.
sort_index([axis, level]) Sort object by labels (along an axis).
sort_values(by[, axis, ascending, inplace, …]) Sort by the values along either axis.
squeeze([axis]) Squeeze 1 dimensional axis objects into scalars.
stack([level, dropna]) Stack the prescribed level(s) from columns to index.
std([axis, skipna, level, ddof, numeric_only]) Return sample standard deviation over requested axis.
sub(other[, axis, level, fill_value]) Get Subtraction of dataframe and other, element-wise (binary operator sub).
subtract(other[, axis, level, fill_value]) Get Subtraction of dataframe and other, element-wise (binary operator sub).
sum([axis, skipna, level, numeric_only, …]) Return the sum of the values over the requested axis.
swapaxes(axis1, axis2[, copy]) Interchange axes and swap values axes appropriately.
swaplevel([i, j, axis]) Swap levels i and j in a MultiIndex on a particular axis.
tail(n) Return the last n rows.
take(indices[, axis]) Return the elements in the given positional indices along an axis.
to_JSON(rec_infolist[, input_current_vis])
to_clipboard(excel, sep, **kwargs) Copy object to the system clipboard.
to_csv(path_or_buf, str, IO[T], …) Write object to a comma-separated values (csv) file.
to_dict([orient, into]) Convert the DataFrame to a dictionary.
to_excel(excel_writer, sheet_name, na_rep, …) Write object to an Excel sheet.
to_feather(path, str, IO[AnyStr], …) Write a DataFrame to the binary Feather format.
to_gbq(destination_table[, project_id, …]) Write a DataFrame to a Google BigQuery table.
to_hdf(path_or_buf, key, mode, complevel, …) Write the contained data to an HDF5 file using HDFStore.
to_html([buf, columns, col_space, header, …]) Render a DataFrame as an HTML table.
to_json(path_or_buf, str, IO[T], …) Convert the object to a JSON string.
to_latex([buf, columns, col_space, header, …]) Render object to a LaTeX tabular, longtable, or nested table/tabular.
to_markdown(buf, str, None] = None, mode, …) Print DataFrame in Markdown-friendly format.
to_numpy([dtype, na_value]) Convert the DataFrame to a NumPy array.
to_pandas()
to_parquet(path, str, IO[T], io.RawIOBase, …) Write a DataFrame to the binary parquet format.
to_period([freq]) Convert DataFrame from DatetimeIndex to PeriodIndex.
to_pickle(path, compression, Dict[str, Any], …) Pickle (serialize) object to file.
to_records([index, column_dtypes, index_dtypes]) Convert DataFrame to a NumPy record array.
to_sql(name, con[, schema, index_label, …]) Write records stored in a DataFrame to a SQL database.
to_stata(path, str, IO[T], io.RawIOBase, …) Export DataFrame object to Stata dta format.
to_string(buf, str, IO[str], io.RawIOBase, …) Render a DataFrame to a console-friendly tabular output.
to_timestamp([freq]) Cast to DatetimeIndex of timestamps, at beginning of period.
to_xarray() Return an xarray object from the pandas object.
transform(func, str, List[Union[Callable, …) Call func on self producing a DataFrame with transformed values.
transpose(*args, copy) Transpose index and columns.
truediv(other[, axis, level, fill_value]) Get Floating division of dataframe and other, element-wise (binary operator truediv).
truncate([before, after, axis]) Truncate a Series or DataFrame before and after some index value.
tshift(periods[, freq]) Shift the time index, using the index’s frequency if available.
tz_convert(tz[, axis, level]) Convert tz-aware axis to target time zone.
tz_localize(tz[, axis, level, ambiguous]) Localize tz-naive index of a Series or DataFrame to target time zone.
unstack([level, fill_value]) Pivot a level of the (necessarily hierarchical) index labels.
update(other[, join, overwrite, …]) Modify in place using non-NA values from another DataFrame.
value_counts(subset, normalize, sort, ascending) Return a Series containing counts of unique rows in the DataFrame.
var([axis, skipna, level, ddof, numeric_only]) Return unbiased variance over requested axis.
where(cond[, other, inplace, axis, level, …]) Replace values where the condition is False.
xs(key[, axis, level]) Return cross-section from the Series/DataFrame.

Attributes

T
at Access a single value for a row/column label pair.
attrs Dictionary of global attributes of this dataset.
axes Return a list representing the axes of the DataFrame.
columns The column labels of the DataFrame.
current_vis
data_type
dtypes Return the dtypes in the DataFrame.
empty Indicator whether DataFrame is empty.
exported Get selected visualizations as exported Vis List
history
iat Access a single value for a row/column pair by integer position.
iloc Purely integer-location based indexing for selection by position.
index The index (row labels) of the DataFrame.
intent Main function to set the intent of the dataframe.
loc Access a group of rows and columns by label(s) or a boolean array.
ndim Return an int representing the number of axes / array dimensions.
recommendation
shape Return a tuple representing the dimensionality of the DataFrame.
size Return an int representing the number of elements in this object.
style Returns a Styler object.
values Return a Numpy representation of the DataFrame.
widget
describe(*args, **kwargs)[source]

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters:
  • percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
  • include ('all', list-like of dtypes or None (default), optional) –

    A white list of data types to include in the result. Ignored for Series. Here are the options:

    • ’all’ : All columns of the input will be included in the output.
    • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
    • None (default) : The result will include all numeric columns.
  • exclude (list-like of dtypes or None (default), optional,) –

    A black list of data types to omit from the result. Ignored for Series. Here are the options:

    • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'
    • None (default) : The result will exclude nothing.
  • datetime_is_numeric (bool, default False) –

    Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.

    New in version 1.1.0.

Returns:

Summary statistics of the Series or Dataframe provided.

Return type:

Series or DataFrame

See also

DataFrame.count()
Count number of non-NA/null observations.
DataFrame.max()
Maximum of the values in the object.
DataFrame.min()
Minimum of the values in the object.
DataFrame.mean()
Mean of the values.
DataFrame.std()
Standard deviation of the observations.
DataFrame.select_dtypes()
Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  # doctest: +SKIP
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[object])  # doctest: +SKIP
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  # doctest: +SKIP
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[object])  # doctest: +SKIP
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
expire_metadata()[source]

Expire all saved metadata to trigger a recomputation the next time the data is required.

expire_recs()[source]

Expires and resets all recommendations

groupby(*args, **kwargs)[source]

Group DataFrame using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters:
  • by (mapping, function, label, or list of labels) – Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Split along rows (0) or columns (1).
  • level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
  • as_index (bool, default True) – For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
  • sort (bool, default True) – Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
  • group_keys (bool, default True) – When calling apply, add group keys to index to identify pieces.
  • squeeze (bool, default False) –

    Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

    Deprecated since version 1.1.0.

  • observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
  • dropna (bool, default True) –

    If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups

    New in version 1.1.0.

Returns:

Returns a groupby object that contains information about the groups.

Return type:

DataFrameGroupBy

See also

resample()
Convenience method for frequency conversion and resampling of time series.

Notes

See the user guide for more.

Examples

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

Hierarchical Indexes

We can groupby different levels of a hierarchical index using the level parameter:

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
>>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
...                   index=index)
>>> df
                Max Speed
Animal Type
Falcon Captive      390.0
       Wild         350.0
Parrot Captive       30.0
       Wild          20.0
>>> df.groupby(level=0).mean()
        Max Speed
Animal
Falcon      370.0
Parrot       25.0
>>> df.groupby(level="Type").mean()
         Max Speed
Type
Captive      210.0
Wild         185.0

We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True:

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
    a   c
b
1.0 2   3
2.0 2   5
>>> df.groupby(by=["b"], dropna=False).sum()
    a   c
b
1.0 2   3
2.0 2   5
NaN 1   4
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by="a").sum()
    b     c
a
a   13.0   13.0
b   12.3  123.0
>>> df.groupby(by="a", dropna=False).sum()
    b     c
a
a   13.0   13.0
b   12.3  123.0
NaN 12.3   33.0
head(n: int = 5)[source]

Return the first n rows.

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

Parameters:n (int, default 5) – Number of rows to select.
Returns:The first n rows of the caller object.
Return type:same type as caller

See also

DataFrame.tail()
Returns the last n rows.

Examples

>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale
8      zebra

Viewing the first 5 lines

>>> df.head()
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey

Viewing the first n lines (three in this case)

>>> df.head(3)
      animal
0  alligator
1        bee
2     falcon

For negative values of n

>>> df.head(-3)
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
info(*args, **kwargs)[source]

Print a concise summary of a DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

Parameters:
  • data (DataFrame) – DataFrame to print information about.
  • verbose (bool, optional) – Whether to print the full summary. By default, the setting in pandas.options.display.max_info_columns is followed.
  • buf (writable buffer, defaults to sys.stdout) – Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.
  • max_cols (int, optional) – When to switch from the verbose to the truncated output. If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used.
  • memory_usage (bool, str, optional) –

    Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the pandas.options.display.memory_usage setting.

    True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.

  • show_counts (bool, optional) – Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than pandas.options.display.max_info_rows and pandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.
  • null_counts (bool, optional) –

    Deprecated since version 1.2.0: Use show_counts instead.

Returns:

This method prints a summary of a DataFrame and returns None.

Return type:

None

See also

DataFrame.describe()
Generate descriptive statistics of DataFrame columns.
DataFrame.memory_usage()
Memory usage of DataFrame columns.

Examples

>>> int_values = [1, 2, 3, 4, 5]
>>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
>>> float_values = [0.0, 0.25, 0.5, 0.75, 1.0]
>>> df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
...                   "float_col": float_values})
>>> df
    int_col text_col  float_col
0        1    alpha       0.00
1        2     beta       0.25
2        3    gamma       0.50
3        4    delta       0.75
4        5  epsilon       1.00

Prints information of all columns:

>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   int_col    5 non-null      int64
 1   text_col   5 non-null      object
 2   float_col  5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Prints a summary of columns count and its dtypes but not per column information:

>>> df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 3 entries, int_col to float_col
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a text file:

>>> import io
>>> buffer = io.StringIO()
>>> df.info(buf=buffer)
>>> s = buffer.getvalue()
>>> with open("df_info.txt", "w",
...           encoding="utf-8") as f:  # doctest: +SKIP
...     f.write(s)
260

The memory_usage parameter allows deep introspection mode, specially useful for big DataFrames and fine-tune memory optimization:

>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6)
>>> df = pd.DataFrame({
...     'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6)
... })
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   column_1  1000000 non-null  object
 1   column_2  1000000 non-null  object
 2   column_3  1000000 non-null  object
dtypes: object(3)
memory usage: 22.9+ MB
>>> df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   column_1  1000000 non-null  object
 1   column_2  1000000 non-null  object
 2   column_3  1000000 non-null  object
dtypes: object(3)
memory usage: 165.9 MB
render_widget(renderer: str = 'altair', input_current_vis='')[source]

Generate a LuxWidget based on the LuxDataFrame

Structure of widgetJSON:

{

‘current_vis’: {}, ‘recommendation’: [

{

‘action’: ‘Correlation’, ‘description’: “some description”, ‘vspec’: [

{Vega-Lite spec for vis 1}, {Vega-Lite spec for vis 2}, …

]

}, … repeat for other actions

]

}

Parameters:
  • renderer (str, optional) – Choice of visualization rendering library, by default “altair”
  • input_current_vis (lux.LuxDataFrame, optional) – User-specified current vis to override default Current Vis, by default
save_as_html(filename: str = 'export.html') → None[source]

Save dataframe widget as static HTML file

Parameters:filename (str) – Filename for the output HTML file
set_data_type(types: dict)[source]

Set the data type for a particular attribute in the dataframe overriding the automatically-detected type inferred by Lux

Parameters:types (dict) – Dictionary that maps attribute/column name to a specified Lux Type. Possible options: “nominal”, “quantitative”, “id”, and “temporal”.

Example

df = pd.read_csv(“https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/absenteeism.csv”) df.set_data_type({“ID”:”id”,

“Reason for absence”:”nominal”})
set_intent_as_vis(vis: lux.vis.Vis.Vis)[source]

Set intent of the dataframe based on the intent of a Vis

Parameters:vis (Vis) – Input Vis object
tail(n: int = 5)[source]

Return the last n rows.

This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.

For negative values of n, this function returns all rows except the first n rows, equivalent to df[n:].

Parameters:n (int, default 5) – Number of rows to select.
Returns:The last n rows of the caller object.
Return type:type of caller

See also

DataFrame.head()
The first n rows of the caller object.

Examples

>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale
8      zebra

Viewing the last 5 lines

>>> df.tail()
   animal
4  monkey
5  parrot
6   shark
7   whale
8   zebra

Viewing the last n lines (three in this case)

>>> df.tail(3)
  animal
6  shark
7  whale
8  zebra

For negative values of n

>>> df.tail(-3)
   animal
3    lion
4  monkey
5  parrot
6   shark
7   whale
8   zebra
exported

Get selected visualizations as exported Vis List

Notes

Convert the _selectedVisIdxs dictionary into a programmable VisList Example _selectedVisIdxs :

{‘Correlation’: [0, 2], ‘Occurrence’: [1]}

indicating the 0th and 2nd vis from the Correlation tab is selected, and the 1st vis from the Occurrence tab is selected.

Returns:When there are no exported vis, return empty list -> [] When all the exported vis is from the same tab, return a VisList of selected visualizations. -> VisList(v1, v2…) When the exported vis is from the different tabs, return a dictionary with the action name as key and selected visualizations in the VisList. -> {“Enhance”: VisList(v1, v2…), “Filter”: VisList(v5, v7…), ..}
Return type:Union[Dict[str,VisList], VisList]
intent

Main function to set the intent of the dataframe. The intent input goes through the parser, so that the string inputs are parsed into a lux.Clause object.

Parameters:intent (List[str,Clause]) – intent list, can be a mix of string shorthand or a lux.Clause object

Notes

../guide/intent