back · main · about · writing · notes · d3 · contact

Pandas functions. Lots of them.
14 Jul 2017 · 981 words

Pandas is your best friend for your data needs. It is the king of data manipulation in the Python empire.

Any data scientist intending to use Python as their tool of choice must master Pandas. It is compulsory, like learning to walk before you run.

So here is a quick reference list of functions, just for you.

But reading written material is no substitute for repeated practice. And hence, you should not expect to remember the functions below. The list is a cheatsheet, not an oracle.

Creating a dataframe

The bread and butter of Pandas. Let’s start with some numpy foreplay.

dates = pd.date_range('2017-06-21', '2017-06-27')
pd.DataFrame(np.random.randint(0,10,7), index=dates, columns=['freq'])

You can also create a dataframe elseways. Here’s a multi-column version from a dictionary.

x = {'a' : np.random.randint(0,10,7), 
     'b' : np.random.randint(0,10,7)}

Creating a series

Series are the loyal servants in the Pandas empire.

To create one, use pd.Series(x, index).

Here, x is a lowly array, dict, scalar, or something else. It will be paired with index for eternity, or until death taketh them. Or a memory leak.

Dataframe functions

Thirty of the finest functions, arranged for your convenience. Master this list, and mastery of self follows.

  1. pd.DataFrame.head()| returns the first five rows of a dataframe.
  2. pd.DataFrame.tail() | returns the last five rows of a dataframe.
  3. pd.DataFrame.index | display the index of a dataframe.
  4. pd.DataFrame.columns | list the columns of a dataframe.
  5. pd.DataFrame.dtypes | print the data types of each column of a dataframe.
  6. pd.DataFrame.values | print the values of a dataframe.
  7. pd.DataFrame.describe() | summarise a dataframe: return summary statistics including the number of observations per column, the mean of each column and the standard deviation of each column.
  8.| brief summary of a dataframe.
  9. pd.DataFrame.T| transpose a dataframe.
  10. pd.DataFrame.sort_index()| sort a dataframe by its index values. Can specify the axis (colnames, rownames) and the order of sorting.
  11. pd.DataFrame.sort_values('col') | sort a dataframe by the column name col.
  12. pd.DataFrame.iloc[i] | slice and subset your data by a numerical index.
  13. pd.DataFrame.loc[]| slice and subset your data by a string.
  14. pd.DataFrame.isin(l) | return True or False depending if the item value is in the list l.
  15. pd.DataFrame.set_index(s)| set the index of a data frame to column name s, where s can be an array of columnnames to create a MultiIndex.
  16. pd.DataFrame.swaplevel(i,j)| swap the levels i and j in a MultiIndex.
  17. pd.DataFrame.drop('c1', axis=1, inplace=True) | drop a column c1 from a dataframe.
  18. pd.DataFrame.iterrows() | a generator for iterating over the rows of a dataframe.
  19. pd.DataFrame.apply(f, axis)| apply a function f vectorwise to a dataframe over a given axis.
  20. pd.DataFrame.applymap(f) | apply a function f elementwise to a dataframe.
  21. pd.DataFrame.drop(s, axis=1) | delete column s from a dataframe.
  22. pd.DataFrame.resample('offsetString') | convenient way to group timeseries into bins. See here for details on the offset string and here for some examples.
  23. pd.DataFrame.merge(df) | join a dataframe df to another dataframe. Can specify the type of join.
  24. pd.DataFrame.append(df) | append the dataframe df to a dataframe, like rbind()in R.
  25. pd.DataFrame.reset_index() | reset the index back to the default numeric row counter.
  26. pd.DataFrame.idxmax() | dataframe equivalent of the numpy argmax method.
  27. pd.DataFrame.isnull() | indicates if values are null or not.
  28. pd.DataFrame.from_dict(d) | create a dataframe from a dictionary d.
  29. pd.DataFrame.stack() | turn column names into index labels.
  30. pd.DataFrame.unstack() | turn index values into column names.

Groupby methods

We turn next to the Groupby methods. A useful family, these ones.

To group a dataframe by a column (or columns), use pd.DataFrame.groupby('colname') . This returns a DataFrameGroupBy object, on which you can call a certain set of methods.

So! say gb is a DataFrameGroupBy object, obtained faithfully from pd.DataFrame.groupby().

There’s some very useful functions you can use; sum, min, max, mean, median and std. Hardworking citizens of the data science empire, those guys.

More useful methods:

String methods

The Pandas library has a module for string manipulation and string handling. This module operates on Series objects and is located at pd.Series.str. Don’t confuse it with Python’s native str. A false friend, that one.

Again! let s be a pd.Series of strings. Then you could do

Miscellaneous functions

You want more! Okay then.

Final words

This is by no means complete; nor does it pretend to be complete.

It’s just a list of functions. No more, no less.

back · main · about · writing · notes · d3 · contact