5. Common Methods on Series or DataFrames

Open In Colab

DataFrame Attributes and Arithmetic

Once you have loaded one or more DataFrames, you may want to investigate various aspects of the data. This could be done by looking at the shape (number of rows and columns) of the DataFrame or the mean of a single column. This could also involve computing arithmetic operations across columns (i.e. Series). The following module focuses on these two concepts and will help you better understand how you can analyze the data you have loaded into Pandas.

DataFrame Attributes

A DataFrame provides various attributes to access information (metadata) about the data it stores. Among these attributes, the ‘shape’ attribute, previously introduced, provides the number of rows and columns. However, several other attributes convey information such as data types and the total number of values. When exploring a dataset, the following four attributes are particularly valuable:

Attribute Description
shape Returns a tuple representing the dimensionality of the DataFrame.
size Returns an int representing the number of elements in this object.
dtypes Returns the data types in the DataFrame.
columns Returns a Series of the header names from the DataFrame

Inspecting Data Types

DataFrame data types determine which methods are applicable to a column. For instance, calculating the mean of an Object column is not feasible, as Pandas interprets this type as containing strings (i.e., textual data).

We have already used .columns attribute to return column labels.

However, what if we wanted to see the data type associated with each column header? Luckily, there is a quick and easy way to do this by accessing the dtypes attribute. dtypes is a series maintained by each DataFrame that contains the data type for each column inside a DataFrame. As an example, if we want to access the dtypes attribute the DataFrame called df (seen below) we can access the dtypes of the DataFrame.

/Types%20Dataframe

Data Types

Remember that Pandas has a number of different data types:

Python Type Equivalent Pandas Type Description
string or mixed object Columns contain partially or completely made up from strings
int int64 Columns with numeric (integer) values. The 64 here refers
to size of the memory space allocated to this type
float float64 Columns with floating points numbers (numbers with decimal points)
bool bool True/False values
datetime datetime Date and/or time values

Pandas can automatically infer column types, but manual specification is required when necessary. For instance, below we will use the column named ‘date mmddyy’ to derive a new column named ‘date’ with the data type ‘datetime’.

We convert the data in the ‘date mmddyy’ column to a new ‘datetime’ Series using the ‘to_datetime’ method, specifying the date format as ‘%m%d%y’, which in date formatting language mean month day and then year with each denoted by two numbers and no separators. This format parameter is based on native Python string conversion to datetime format. More information can be found in the Python docs. (Link to string to datetime conversion docs)

Now that we have the correct output format, we can create a new column to hold the converted data by creating a new named column. We will also drop the previously used ‘date mmddyy’ column to prevent confusion.

For reference, this is what the final DataFrame looks like. Note that the date column is at the right side of the DataFrame since it was added last.

/Converted%20DataFrame

DataFrame Methods

When dealing with a DataFrame, there are various built-in methods to help summarize the data. Mehtods are accessible using the df.method() syntax, where df is a DataFrame. A list of some of these methods is provided below:

Method Description
head() Return the first n rows.
tail() Return the last n rows.
min(), max() Computes the numeric (for numeric value) or alphanumeric (for object values) row-wise min, max in a Series or DataFrame.
sum(), mean(), std(), var() Computes the row-wise sum, mean, standard deviation and variance in a Series or DataFrame.
nlargest() \tReturn the first n rows of the Series or DataFrame, ordered by the specified columns in descending order.
count() Returns the number of non-NaN values in the in a Series or DataFrame.
value_counts() Returns the frequency for each value in the Series.
describe() Computes row-wise statistics.

In this section, we will explore the mean() method, which serves as a typical example of method behavior. Subsequently, we will employ the describe() method, which presents more complex functionality.

mean() Method

The mean() method calculates the mean across a given axis (rows = 0, columns = 1). As an example let’s reuse our previous DataFrame df.

If we want to find the mean of all our numeric columns, we could use the following command:

Single Column (Series) Methods

If we only want the mean of a single column we would instead give the mean() method a single column (i.e. a Series). This could be done for the latitude column in the example above via the code bit df['Latitude'].mean() which would return a single value 31.09682 which is the mean of that column (as seen above).

Other methods like max(), var(), and count() work the same way as mean().

describe() Method

The describe() method offers a variety of statistical summaries for the DataFrame based on its contents, with some customization options demonstrated below.

Here we get various statistics, such as the mean of each column, how many non-NaN values contained in each column, the standard deviation of the column, etc. The percent values correspond to the different percentiles of each column e.g. the 25% percentile. The presence of NaN values indicates that certain statistics, e.g., the mean(), cannot be computed for an object type column.

Accessor Methods

For Series containing specialized data types (like strings, datetime values, or categorical data), Pandas provides accessors to offer specialized methods for those data types. These are accessed as series.accessor.method() where series is a Series object. Here are some of the examples of accessors and their methods:

Method Description
upper() Converts strings in the Series to uppercase.
lower() Converts strings in the Series to lowercase.
len() Computes the length of each string.
year, month, day, hour Returns the year, month, day and hour of the datetime.
categories Returns the categories of the Series.
ordered Checks if the categories have an order.

Pandas provides accessors, i.e., mechanisms that enable you to use specialized methods and functions tailored for specific data types, such as strings, datetime values, or categorical data. These accessors provide access to methods tailored to these data types. Access to these methods is achieved through the syntax series.data_type.method(), where series represents a Series object. Below are examples of these accessors and their corresponding methods as used on sample dataframes:

DataFrame Arithmetic

To perform arithmetic operations between columns or rows of distinct Pandas DataFrames, Pandas provides a straightforward solution. For instance, if we have two DataFrames as depicted below and wish to add the ‘AA’ columns together, we can achieve this by utilizing the following code snippet:

/Alignment%20Arithmetic%20Columns

To perform this operation, Pandas will first align the two columns (Series) based on their indexes. Following this, any indexes that contain values in both Series will have their sum calculated. However, for indexes where one of the Series value is NaN the output value will be NaN. A diagram of this process is shown below:

/Alignment%20Arithmetic%20Method

In our notebook we would get a Series as our output:

DataFrame Row Arithmetic

Calculating the sum across different rows is quite similar to column-wise calculations. The key difference is that you must use a method for selecting rows, such as .loc. An illustrative figure is shown below.

/Alignment%20Arithmetic%20Row

Broadcasting

Pandas also allows you to do arithmetic operations between a DataFrame or Series and a scalar (i.e. a single number). If you were to do the following code bit using the ‘AA’ column from the previously described DataFrame called df_1

Here you essentially just add 0.3 to each entry in the Series. The same occurs if you were to do it for a whole DataFrame with 0.3 being added to each entry. Note: this only works for DataFrames that are entirely numeric, if there are any object columns you will get an error message.