Python and Pandas

One package that takes away so much manual coding is Pandas.

Pandas has features allowing you to read a CSV, JSON, SPSS, XML or XLSX into a data object. These files can be imported into a in-memory data object called a DataFrame.

Depending on the source, some data cleanup may be required. Also, you will have to instruct Pandas how to deal with 'missing' values. Should they be processed as zeros in calculating averages, for instance (no being the likely answer)? Often, after reading in a CSV, some date transformation may be required as well.

After tidying up the data you can manipulate the DataFrame object with very concise functions. Often you will use Pandas in combination with Numpy.

You will have to realize that Pandas keeps the data in-memory and with some actions will actually may make in-memory copies of datasets. If your data is sizable, consider working with subsets. In case your data is sparse (lots of empty or zero cells) there are advanced strategies to deal with this situation. This section from the user guide may be helpful.

Previous chapter | Next chapter

Loading a CSV file

import pandas as pd
df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
# Skiprows help you skip over any rows that do not contain relevant data
# such as column names
# df is of type DataFrame

Previous chapter | Next chapter