Introduction

In this article, I will walk you through on how to import csv data into pandas and start exploring it.

import pandas module

We start with importing the pandas module.

import pandas as pd

Introudce dataset

Next, we will read a csv file using pandas. It’s a csv file containing columns as Rank, country, gold, silver, bronze and total medals

Data

This is the data we are going to be using for this lesson. The data we are using is from Rio Olympics The data is a list of countries and the medals they won.

Import dataset

To read csv data into pandas, we will use pandas read_csv method The read data will be read in a pandas dataframe object.

df = pd.read_csv('olympics.csv')

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. It is the most commonly used pandas object.

To print the data inside the dataframe, we just call it’s name, like this

df
Rank Country Gold Silver Bronze Total
0 1 United States (USA) 46 37 38 121
1 2 Great Britain (GBR) 27 23 17 67
2 3 China (CHN) 26 18 26 70
3 4 Russia (RUS) 19 17 19 55
4 5 Germany (GER) 17 10 15 42
5 6 Japan (JPN) 12 8 21 41
6 7 France (FRA) 10 18 14 42
7 8 South Korea (KOR) 9 3 9 21
8 9 Italy (ITA) 8 12 8 28
9 10 Australia (AUS) 8 11 10 29
10 11 Netherlands (NED) 8 7 4 19
11 12 Hungary (HUN) 8 3 4 15
12 13 Brazil (BRA)* 7 6 6 19
13 14 Spain (ESP) 7 4 6 17
14 15 Kenya (KEN) 6 6 1 13
15 16 Jamaica (JAM) 6 3 2 11
16 17 Croatia (CRO) 5 3 2 10
17 18 Cuba (CUB) 5 2 4 11
18 19 New Zealand (NZL) 4 9 5 18
19 20 Canada (CAN) 4 3 15 22
20 21 Uzbekistan (UZB) 4 2 7 13
21 22 Kazakhstan (KAZ) 3 5 9 17
22 23 Colombia (COL) 3 2 3 8
23 24 Switzerland (SUI) 3 2 2 7
24 25 Iran (IRI) 3 1 4 8
25 26 Greece (GRE) 3 1 2 6
26 27 Argentina (ARG) 3 1 0 4
27 28 Denmark (DEN) 2 6 7 15
28 29 Sweden (SWE) 2 6 3 11
29 30 South Africa (RSA) 2 6 2 10
... ... ... ... ... ... ...
57 54 Singapore (SIN) 1 0 0 1
58 54 Tajikistan (TJK) 1 0 0 1
59 60 Malaysia (MAS) 0 4 1 5
60 61 Mexico (MEX) 0 3 2 5
61 62 Algeria (ALG) 0 2 0 2
62 62 Ireland (IRL) 0 2 0 2
63 64 Lithuania (LTU) 0 1 3 4
64 65 Bulgaria (BUL) 0 1 2 3
65 65 Venezuela (VEN) 0 1 2 3
66 67 India (IND) 0 1 1 2
67 67 Mongolia (MGL) 0 1 1 2
68 69 Burundi (BDI) 0 1 0 1
69 69 Grenada (GRN) 0 1 0 1
70 69 Niger (NIG) 0 1 0 1
71 69 Philippines (PHI) 0 1 0 1
72 69 Qatar (QAT) 0 1 0 1
73 74 Norway (NOR) 0 0 4 4
74 75 Egypt (EGY) 0 0 3 3
75 75 Tunisia (TUN) 0 0 3 3
76 77 Israel (ISR) 0 0 2 2
77 78 Austria (AUT) 0 0 1 1
78 78 Dominican Republic (DOM) 0 0 1 1
79 78 Estonia (EST) 0 0 1 1
80 78 Finland (FIN) 0 0 1 1
81 78 Morocco (MAR) 0 0 1 1
82 78 Moldova (MDA) 0 0 1 1
83 78 Nigeria (NGR) 0 0 1 1
84 78 Portugal (POR) 0 0 1 1
85 78 Trinidad and Tobago (TTO) 0 0 1 1
86 78 United Arab Emirates (UAE) 0 0 1 1

87 rows × 6 columns

As you can see, our csv data is now a 2D data structure and each individual element can be accessed directly

If no index is specified, pandas automatically assigns a index, starting from 0. However, in our case, we may want to specify country as the index. In that case we have to pass a extra option to the read_csv table like this.

df = pd.read_csv('olympics.csv', index_col='Country')
df
Rank Gold Silver Bronze Total
Country
United States (USA) 1 46 37 38 121
Great Britain (GBR) 2 27 23 17 67
China (CHN) 3 26 18 26 70
Russia (RUS) 4 19 17 19 55
Germany (GER) 5 17 10 15 42
Japan (JPN) 6 12 8 21 41
France (FRA) 7 10 18 14 42
South Korea (KOR) 8 9 3 9 21
Italy (ITA) 9 8 12 8 28
Australia (AUS) 10 8 11 10 29
Netherlands (NED) 11 8 7 4 19
Hungary (HUN) 12 8 3 4 15
Brazil (BRA)* 13 7 6 6 19
Spain (ESP) 14 7 4 6 17
Kenya (KEN) 15 6 6 1 13
Jamaica (JAM) 16 6 3 2 11
Croatia (CRO) 17 5 3 2 10
Cuba (CUB) 18 5 2 4 11
New Zealand (NZL) 19 4 9 5 18
Canada (CAN) 20 4 3 15 22
Uzbekistan (UZB) 21 4 2 7 13
Kazakhstan (KAZ) 22 3 5 9 17
Colombia (COL) 23 3 2 3 8
Switzerland (SUI) 24 3 2 2 7
Iran (IRI) 25 3 1 4 8
Greece (GRE) 26 3 1 2 6
Argentina (ARG) 27 3 1 0 4
Denmark (DEN) 28 2 6 7 15
Sweden (SWE) 29 2 6 3 11
South Africa (RSA) 30 2 6 2 10
... ... ... ... ... ...
Singapore (SIN) 54 1 0 0 1
Tajikistan (TJK) 54 1 0 0 1
Malaysia (MAS) 60 0 4 1 5
Mexico (MEX) 61 0 3 2 5
Algeria (ALG) 62 0 2 0 2
Ireland (IRL) 62 0 2 0 2
Lithuania (LTU) 64 0 1 3 4
Bulgaria (BUL) 65 0 1 2 3
Venezuela (VEN) 65 0 1 2 3
India (IND) 67 0 1 1 2
Mongolia (MGL) 67 0 1 1 2
Burundi (BDI) 69 0 1 0 1
Grenada (GRN) 69 0 1 0 1
Niger (NIG) 69 0 1 0 1
Philippines (PHI) 69 0 1 0 1
Qatar (QAT) 69 0 1 0 1
Norway (NOR) 74 0 0 4 4
Egypt (EGY) 75 0 0 3 3
Tunisia (TUN) 75 0 0 3 3
Israel (ISR) 77 0 0 2 2
Austria (AUT) 78 0 0 1 1
Dominican Republic (DOM) 78 0 0 1 1
Estonia (EST) 78 0 0 1 1
Finland (FIN) 78 0 0 1 1
Morocco (MAR) 78 0 0 1 1
Moldova (MDA) 78 0 0 1 1
Nigeria (NGR) 78 0 0 1 1
Portugal (POR) 78 0 0 1 1
Trinidad and Tobago (TTO) 78 0 0 1 1
United Arab Emirates (UAE) 78 0 0 1 1

87 rows × 5 columns

Now, we have Country name as index for our dataframe.

To check, what are the datatypes of the various columns, we can call dtypes on the dataframe we just created

df.dtypes
Rank      int64
Gold      int64
Silver    int64
Bronze    int64
Total     int64
dtype: object

To see the top rows of the dataframe, use the head method

df.head()
Rank Gold Silver Bronze Total
Country
United States (USA) 1 46 37 38 121
Great Britain (GBR) 2 27 23 17 67
China (CHN) 3 26 18 26 70
Russia (RUS) 4 19 17 19 55
Germany (GER) 5 17 10 15 42

To see the bottom rows of the dataframe, use tail method

df.tail()
Rank Gold Silver Bronze Total
Country
Moldova (MDA) 78 0 0 1 1
Nigeria (NGR) 78 0 0 1 1
Portugal (POR) 78 0 0 1 1
Trinidad and Tobago (TTO) 78 0 0 1 1
United Arab Emirates (UAE) 78 0 0 1 1

Calling columns method on the dataframe will print just the column names

df.columns
Index(['Rank', 'Gold', 'Silver', 'Bronze', 'Total'], dtype='object')

To see a quick statistical summary of the your data, you can call describe method.

df.describe()
Rank Gold Silver Bronze Total
count 87.000000 87.000000 87.000000 87.000000 87.000000
mean 43.091954 3.528736 3.517241 4.126437 11.172414
std 24.193326 6.867245 5.726036 6.263227 18.213340
min 1.000000 0.000000 0.000000 0.000000 1.000000
25% 22.500000 0.000000 0.000000 1.000000 2.000000
50% 44.000000 1.000000 2.000000 2.000000 4.000000
75% 65.000000 3.000000 4.000000 4.000000 11.000000
max 78.000000 46.000000 37.000000 38.000000 121.000000

To get a single column of data or a subset of data, you need to pass on that column to the dataframe For example, to print just country names, we will say df and pass Country in square brackets

df['Total']
Country
United States (USA)           121
Great Britain (GBR)            67
China (CHN)                    70
Russia (RUS)                   55
Germany (GER)                  42
Japan (JPN)                    41
France (FRA)                   42
South Korea (KOR)              21
Italy (ITA)                    28
Australia (AUS)                29
Netherlands (NED)              19
Hungary (HUN)                  15
Brazil (BRA)*                  19
Spain (ESP)                    17
Kenya (KEN)                    13
Jamaica (JAM)                  11
Croatia (CRO)                  10
Cuba (CUB)                     11
New Zealand (NZL)              18
Canada (CAN)                   22
Uzbekistan (UZB)               13
Kazakhstan (KAZ)               17
Colombia (COL)                  8
Switzerland (SUI)               7
Iran (IRI)                      8
Greece (GRE)                    6
Argentina (ARG)                 4
Denmark (DEN)                  15
Sweden (SWE)                   11
South Africa (RSA)             10
                             ... 
Singapore (SIN)                 1
Tajikistan (TJK)                1
Malaysia (MAS)                  5
Mexico (MEX)                    5
Algeria (ALG)                   2
Ireland (IRL)                   2
Lithuania (LTU)                 4
Bulgaria (BUL)                  3
Venezuela (VEN)                 3
India (IND)                     2
Mongolia (MGL)                  2
Burundi (BDI)                   1
Grenada (GRN)                   1
Niger (NIG)                     1
Philippines (PHI)               1
Qatar (QAT)                     1
Norway (NOR)                    4
Egypt (EGY)                     3
Tunisia (TUN)                   3
Israel (ISR)                    2
Austria (AUT)                   1
Dominican Republic (DOM)        1
Estonia (EST)                   1
Finland (FIN)                   1
Morocco (MAR)                   1
Moldova (MDA)                   1
Nigeria (NGR)                   1
Portugal (POR)                  1
Trinidad and Tobago (TTO)       1
United Arab Emirates (UAE)      1
Name: Total, dtype: int64

To select only a certain rows, you have pass the index of those rows to the dataframe

df[2:6]
Rank Gold Silver Bronze Total
Country
China (CHN) 3 26 18 26 70
Russia (RUS) 4 19 17 19 55
Germany (GER) 5 17 10 15 42
Japan (JPN) 6 12 8 21 41

Or if you need to select a particular row where you know the value of certain column

df[df.Rank == 1]
Rank Gold Silver Bronze Total
Country
United States (USA) 1 46 37 38 121

Or if you want to get all the data for a particular country, you can call loc method and pass country as parameter.

df.loc['China (CHN)']
Rank       3
Gold      26
Silver    18
Bronze    26
Total     70
Name: China (CHN), dtype: int64

Conclusion

In this article We learned how to import csv data and start exploring it in Python pandas. I have published a whole video course on Learning Pandas for Data Analysis, if you want to check it out.