Pandas functions. Lots of them.

Pandas is your best friend for your data needs. It is the king of data manipulation in the Python empire.

Any data scientist intending to use Python as their tool of choice must master Pandas. It is compulsory, like learning to walk before you run.

So here is a quick reference list of functions, just for you.

But reading written material is no substitute for repeated practice. And hence, you should not expect to remember the functions below. The list is a cheatsheet, not an oracle.

Creating a dataframe Link to heading

The bread and butter of Pandas. Let’s start with some numpy foreplay.

dates = pd.date_range('2017-06-21', '2017-06-27')
pd.DataFrame(np.random.randint(0,10,7), index=dates, columns=['freq'])

You can also create a dataframe elseways. Here’s a multi-column version from a dictionary.

x = {'a' : np.random.randint(0,10,7), 
     'b' : np.random.randint(0,10,7)}
pd.DataFrame(x)

Creating a series Link to heading

Series are the loyal servants in the Pandas empire.

To create one, use pd.Series(x, index).

Here, x is a lowly array, dict, scalar, or something else. It will be paired with index for eternity, or until death taketh them. Or a memory leak.

Dataframe functions Link to heading

Thirty of the finest functions, arranged for your convenience. Master this list, and mastery of self follows.

  1. pd.DataFrame.head()| returns the first five rows of a dataframe.
  2. pd.DataFrame.tail() | returns the last five rows of a dataframe.
  3. pd.DataFrame.index | display the index of a dataframe.
  4. pd.DataFrame.columns | list the columns of a dataframe.
  5. pd.DataFrame.dtypes | print the data types of each column of a dataframe.
  6. pd.DataFrame.values | print the values of a dataframe.
  7. pd.DataFrame.describe() | summarise a dataframe: return summary statistics including the number of observations per column, the mean of each column and the standard deviation of each column.
  8. pd.DataFrame.info()| brief summary of a dataframe.
  9. pd.DataFrame.T| transpose a dataframe.
  10. pd.DataFrame.sort_index()| sort a dataframe by its index values. Can specify the axis (colnames, rownames) and the order of sorting.
  11. pd.DataFrame.sort_values('col') | sort a dataframe by the column name col.
  12. pd.DataFrame.iloc[i] | slice and subset your data by a numerical index.
  13. pd.DataFrame.loc[]| slice and subset your data by a string.
  14. pd.DataFrame.isin(l) | return True or False depending if the item value is in the list l.
  15. pd.DataFrame.set_index(s)| set the index of a data frame to column name s, where s can be an array of columnnames to create a MultiIndex.
  16. pd.DataFrame.swaplevel(i,j)| swap the levels i and j in a MultiIndex.
  17. pd.DataFrame.drop('c1', axis=1, inplace=True) | drop a column c1 from a dataframe.
  18. pd.DataFrame.iterrows() | a generator for iterating over the rows of a dataframe.
  19. pd.DataFrame.apply(f, axis)| apply a function f vectorwise to a dataframe over a given axis.
  20. pd.DataFrame.applymap(f) | apply a function f elementwise to a dataframe.
  21. pd.DataFrame.drop(s, axis=1) | delete column s from a dataframe.
  22. pd.DataFrame.resample('offsetString') | convenient way to group timeseries into bins. See here for details on the offset string and here for some examples.
  23. pd.DataFrame.merge(df) | join a dataframe df to another dataframe. Can specify the type of join.
  24. pd.DataFrame.append(df) | append the dataframe df to a dataframe, like rbind()in R.
  25. pd.DataFrame.reset_index() | reset the index back to the default numeric row counter.
  26. pd.DataFrame.idxmax() | dataframe equivalent of the numpy argmax method.
  27. pd.DataFrame.isnull() | indicates if values are null or not.
  28. pd.DataFrame.from_dict(d) | create a dataframe from a dictionary d.
  29. pd.DataFrame.stack() | turn column names into index labels.
  30. pd.DataFrame.unstack() | turn index values into column names.

Groupby methods Link to heading

We turn next to the Groupby methods. A useful family, these ones.

To group a dataframe by a column (or columns), use pd.DataFrame.groupby('colname') . This returns a DataFrameGroupBy object, on which you can call a certain set of methods.

So! say gb is a DataFrameGroupBy object, obtained faithfully from pd.DataFrame.groupby().

There’s some very useful functions you can use; sum, min, max, mean, median and std. Hardworking citizens of the data science empire, those guys.

More useful methods:

  • gb.agg(arr) | returns whatever functions you specify in array arr
  • gb.size() | return the number of elements in each group.
  • gb.describe() | returns summary statistics.

String methods Link to heading

The Pandas library has a module for string manipulation and string handling. This module operates on Series objects and is located at pd.Series.str. Don’t confuse it with Python’s native str. A false friend, that one.

Again! let s be a pd.Series of strings. Then you could do

  • s.str[0] – return the first letter of each element in s.
  • s.str.lower() – change each element of s to lowercase.
  • s.str.upper() – change each element of s to uppercase.
  • s.str.len() – return the number of letters of each element of s.
  • s.str.strip() – remove whitespace around the elements of s.
  • s.str.replace('s1', 's2') – replace a substring s1 with a substring s2 for each element of s.
  • s.str.split('s1') – split up the elements of s using s1 as a separator.
  • s.str.get(i) – extract the ith element of each array of s.

Miscellaneous functions Link to heading

You want more! Okay then.

  • pd.__version__ | return the version of Pandas.
  • pd.date_range() | create a series of dates in a DateTimeIndex. Some options include: a start date and an end date (e.g. pd.date_range('2015-01-05', '2015-01-10') ) a start date, end date and a frequency (e.g. pd.date_range('2016-01', '2016-10',freq='M') ) a start date and the number of periods (e.g. pd.date_range('2016-01', periods=10) )
  • pd.read_csv(filepath, sep, index_col) | read in a CSV file, often from a web address or file. Specify the separator with the sep parameter, and the column to use as the rownames of the table with the index_col parameter.
  • pd.value_counts() | count how many times a value appears in a column.
  • pd.crosstab() | create frequency table of two or more factors.
  • pd.Series.map(f) | the Series version of applymap.
  • pd.to_datetime() | convert something to a numpy datetime64 format.
  • pd.to_numeric() | convert something to a float format.
  • pd.concat(objs) | put together data frames in the array objs along a given axis, similar to rbind() or cbind() in R.

Final words Link to heading

This is by no means complete; nor does it pretend to be complete.

It’s just a list of functions. No more, no less.