lux.core package

Submodules

lux.core.frame module

class lux.core.frame.LuxDataFrame(*args, **kw)[source]

Bases: pandas.core.frame.DataFrame

A subclass of pd.DataFrame that supports all dataframe operations while housing other variables and functions for generating visual recommendations.

clear_intent()[source]
clear_plot_config()[source]
compute_SQL_data_type()[source]
compute_SQL_dataset_metadata()[source]
compute_SQL_stats()[source]
copy_intent()[source]
static current_vis_to_JSON(vlist, input_current_vis='')[source]
describe(*args, **kwargs)[source]

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters
  • percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

  • include ('all', list-like of dtypes or None (default), optional) –

    A white list of data types to include in the result. Ignored for Series. Here are the options:

    • ’all’ : All columns of the input will be included in the output.

    • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'

    • None (default) : The result will include all numeric columns.

  • exclude (list-like of dtypes or None (default), optional,) –

    A black list of data types to omit from the result. Ignored for Series. Here are the options:

    • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'

    • None (default) : The result will exclude nothing.

  • datetime_is_numeric (bool, default False) –

    Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.

    New in version 1.1.0.

Returns

Summary statistics of the Series or Dataframe provided.

Return type

Series or DataFrame

See also

DataFrame.count()

Count number of non-NA/null observations.

DataFrame.max()

Maximum of the values in the object.

DataFrame.min()

Minimum of the values in the object.

DataFrame.mean()

Mean of the values.

DataFrame.std()

Standard deviation of the observations.

DataFrame.select_dtypes()

Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[object])  
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              f
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
display_pandas()[source]
expire_metadata()[source]
expire_recs()[source]
get_SQL_attributes()[source]
get_SQL_cardinality()[source]
get_SQL_unique_values()[source]
head(n: int = 5)[source]

Return the first n rows.

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

Parameters

n (int, default 5) – Number of rows to select.

Returns

The first n rows of the caller object.

Return type

same type as caller

See also

DataFrame.tail()

Returns the last n rows.

Examples

>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale
8      zebra

Viewing the first 5 lines

>>> df.head()
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey

Viewing the first n lines (three in this case)

>>> df.head(3)
      animal
0  alligator
1        bee
2     falcon

For negative values of n

>>> df.head(-3)
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
info(*args, **kwargs)[source]

Print a concise summary of a DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

Parameters
  • data (DataFrame) – DataFrame to print information about.

  • verbose (bool, optional) – Whether to print the full summary. By default, the setting in pandas.options.display.max_info_columns is followed.

  • buf (writable buffer, defaults to sys.stdout) – Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.

  • max_cols (int, optional) – When to switch from the verbose to the truncated output. If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used.

  • memory_usage (bool, str, optional) –

    Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the pandas.options.display.memory_usage setting.

    True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.

  • null_counts (bool, optional) – Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than pandas.options.display.max_info_rows and pandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.

Returns

This method prints a summary of a DataFrame and returns None.

Return type

None

See also

DataFrame.describe()

Generate descriptive statistics of DataFrame columns.

DataFrame.memory_usage()

Memory usage of DataFrame columns.

Examples

>>> int_values = [1, 2, 3, 4, 5]
>>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
>>> float_values = [0.0, 0.25, 0.5, 0.75, 1.0]
>>> df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
...                   "float_col": float_values})
>>> df
    int_col text_col  float_col
0        1    alpha       0.00
1        2     beta       0.25
2        3    gamma       0.50
3        4    delta       0.75
4        5  epsilon       1.00

Prints information of all columns:

>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   int_col    5 non-null      int64
 1   text_col   5 non-null      object
 2   float_col  5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Prints a summary of columns count and its dtypes but not per column information:

>>> df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 3 entries, int_col to float_col
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a text file:

>>> import io
>>> buffer = io.StringIO()
>>> df.info(buf=buffer)
>>> s = buffer.getvalue()
>>> with open("df_info.txt", "w",
...           encoding="utf-8") as f:  
...     f.write(s)
260

The memory_usage parameter allows deep introspection mode, specially useful for big DataFrames and fine-tune memory optimization:

>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6)
>>> df = pd.DataFrame({
...     'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6)
... })
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   column_1  1000000 non-null  object
 1   column_2  1000000 non-null  object
 2   column_3  1000000 non-null  object
dtypes: object(3)
memory usage: 22.9+ MB
>>> df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   column_1  1000000 non-null  object
 1   column_2  1000000 non-null  object
 2   column_3  1000000 non-null  object
dtypes: object(3)
memory usage: 188.8 MB
static intent_to_JSON(intent)[source]
static intent_to_string(intent)[source]
maintain_metadata()[source]
maintain_recs()[source]
static rec_to_JSON(recs)[source]
removeDeletedRecs(change)[source]
render_widget(renderer: str = 'altair', input_current_vis='')[source]

Generate a LuxWidget based on the LuxDataFrame

Structure of widgetJSON: {

‘current_vis’: {}, ‘recommendation’: [

{

‘action’: ‘Correlation’, ‘description’: “some description”, ‘vspec’: [

{Vega-Lite spec for vis 1}, {Vega-Lite spec for vis 2}, …

]

}, … repeat for other actions

]

} :param renderer: Choice of visualization rendering library, by default “altair” :type renderer: str, optional :param input_current_vis: User-specified current vis to override default Current Vis, by default :type input_current_vis: lux.LuxDataFrame, optional

set_SQL_connection(connection, t_name)[source]
set_executor_type(exe)[source]
set_intent(intent: List[Union[str, lux.vis.Clause.Clause]])[source]

Main function to set the intent of the dataframe. The intent input goes through the parser, so that the string inputs are parsed into a lux.Clause object.

Parameters

intent (List[str,Clause]) – intent list, can be a mix of string shorthand or a lux.Clause object

Notes

../guide/clause

set_intent_as_vis(vis: lux.vis.Vis.Vis)[source]

Set intent of the dataframe as the Vis

Parameters

vis (Vis) –

tail(n: int = 5)[source]

Return the last n rows.

This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.

For negative values of n, this function returns all rows except the first n rows, equivalent to df[n:].

Parameters

n (int, default 5) – Number of rows to select.

Returns

The last n rows of the caller object.

Return type

type of caller

See also

DataFrame.head()

The first n rows of the caller object.

Examples

>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale
8      zebra

Viewing the last 5 lines

>>> df.tail()
   animal
4  monkey
5  parrot
6   shark
7   whale
8   zebra

Viewing the last n lines (three in this case)

>>> df.tail(3)
  animal
6  shark
7  whale
8  zebra

For negative values of n

>>> df.tail(-3)
   animal
3    lion
4  monkey
5  parrot
6   shark
7   whale
8   zebra
to_JSON(rec_infolist, input_current_vis='')[source]
to_pandas()[source]
property current_vis
property default_display
property exported

Get selected visualizations as exported Vis List

Notes

Convert the _exportedVisIdxs dictionary into a programmable VisList Example _exportedVisIdxs :

{‘Correlation’: [0, 2], ‘Occurrence’: [1]}

indicating the 0th and 2nd vis from the Correlation tab is selected, and the 1st vis from the Occurrence tab is selected.

Returns

When there are no exported vis, return empty list -> [] When all the exported vis is from the same tab, return a VisList of selected visualizations. -> VisList(v1, v2…) When the exported vis is from the different tabs, return a dictionary with the action name as key and selected visualizations in the VisList. -> {“Enhance”: VisList(v1, v2…), “Filter”: VisList(v5, v7…), ..}

Return type

Union[Dict[str,VisList], VisList]

property history
property intent
property plot_config
property recommendation
property widget

Module contents

lux.core.setOption(overridePandas=True)[source]