Introduction
One of many causes I like utilizing pandas as an alternative of Excel for knowledge evaluation is that it’s
simpler to keep away from sure sorts of copy-paste Excel errors. As nice as pandas is, there
remains to be loads of alternative to make errors with pandas code. This text discusses a
refined situation with pandas
groupby
code that may result in massive errors when you’re not
cautious. I’m scripting this as a result of I’ve occurred upon this up to now nevertheless it nonetheless bit
me massive time only recently. I hope this text might help just a few of you keep away from this mistake.
The Downside
As an instance this drawback, we’ll use a easy knowledge set that reveals gross sales for 20 prospects
and consists of their area and an inner buyer section designation of Platinum, Gold
or Silver. Right here is the full knowledge set:
Buyer ID | Buyer Identify | Area | Section | Gross sales | |
---|---|---|---|---|---|
0 | 740150 | Barton LLC | US | Gold | 215000 |
1 | 714466 | Trantow-Barrows | EMEA | Silver | 430000 |
2 | 218895 | Kulas Inc | EMEA | Platinum | 410000 |
3 | 307599 | Kassulke, Ondricka and Metz | EMEA | 91000 | |
4 | 412290 | Jerde-Hilpert | EMEA | Gold | 630000 |
5 | 729833 | Koepp Ltd | US | 230000 | |
6 | 737550 | Fritsch, Russel and Anderson | US | Gold | 630000 |
7 | 146832 | Kiehn-Spinka | US | Silver | 615000 |
8 | 688981 | Keeling LLC | US | Platinum | 515000 |
9 | 786968 | Frami, Hills and Schmidt | US | Gold | 215000 |
10 | 239344 | Stokes LLC | US | Silver | 230000 |
11 | 672390 | Kuhn-Gusikowski | APAC | Platinum | 630000 |
12 | 141962 | Herman LLC | APAC | Gold | 215000 |
13 | 424914 | White-Trantow | US | 230000 | |
14 | 527099 | Sanford and Sons | US | Platinum | 615000 |
15 | 642753 | Pollich LLC | US | Gold | 419000 |
16 | 383080 | Will LLC | US | Silver | 415000 |
17 | 257198 | Cronin, Oberbrunner and Spencer | US | Platinum | 425000 |
18 | 604255 | Halvorson, Crona and Champlin | US | 430000 | |
19 | 163416 | Purdy-Kunde | APAC | Silver | 410000 |
The information seems fairly easy. There’s just one numeric column so let’s see what it
totals to.
import pandas as pd
df = pd.read_excel('https://github.com/chris1610/pbpython/uncooked/grasp/knowledge/sales_9_2022.xlsx')
df["Sales"].sum()
We’ve $8,000,000 in gross sales within the spreadsheet. Preserve that quantity in thoughts.
Let’s do some easy evaluation to summarize gross sales by area:
df.groupby(['Region']).agg({'Gross sales': 'sum'})
Gross sales | |
---|---|
Area | |
APAC | 1255000 |
EMEA | 1561000 |
US | 5184000 |
We are able to double examine the math:
df.groupby(['Region']).agg({'Gross sales': 'sum'}).sum()
Gross sales 8000000
dtype: int64
Seems to be good. That’s what we count on. Now let’s see what gross sales appear to be by Section:
df.groupby(['Region', 'Segment']).agg({'Gross sales': 'sum'})
Which yields this desk:
Gross sales | ||
---|---|---|
Area | Section | |
APAC | Gold | 215000 |
Platinum | 630000 | |
Silver | 410000 | |
EMEA | Gold | 630000 |
Platinum | 410000 | |
Silver | 430000 | |
US | Gold | 1479000 |
Platinum | 1555000 | |
Silver | 1260000 |
This seems good. No errors and the desk appears affordable. We should always proceed our evaluation proper?
Nope. There’s a probably refined situation right here. Let’s sum the information to double examine:
df.groupby(['Region', 'Segment']).agg({'Gross sales': 'sum'}).sum()
Gross sales 7019000
dtype: int64
This solely consists of $7,019,000. The place did the opposite $981,000 go? Is pandas damaged?
You may see the problem clearly if we use the
dropna=False
parameter to explicitly
embrace
NaN
values in our outcomes:
df.groupby(['Region', 'Segment'], dropna=False).agg({'Gross sales': 'sum'})
Now we will see the
NaN
mixtures with EMEA and the US groupings:
Gross sales | ||
---|---|---|
Area | Section | |
APAC | Gold | 215000 |
Platinum | 630000 | |
Silver | 410000 | |
EMEA | Gold | 630000 |
Platinum | 410000 | |
Silver | 430000 | |
NaN | 91000 | |
US | Gold | 1479000 |
Platinum | 1555000 | |
Silver | 1260000 | |
NaN | 890000 |
If we examine the sum, we will see it totals to $8M.
df.groupby(['Region', 'Segment'], dropna=False).agg({'Gross sales': 'sum'}).sum()
Gross sales 8000000
dtype: int64
The pandas documentation may be very clear on this:
dropna:
bool, default True- If True, and if group keys comprise NA values, NA values along with row/column will
be dropped. If False, NA values may even be handled as the important thing in teams.
The take away is that in case your
groupby
columns comprise any NA values,
then it’s worthwhile to make a acutely aware resolution about whether or not or not you wish to embrace these
values within the grouped outcomes.
If you’re okay dropping these values, then use the default
dropna=True
.
Nevertheless, if you wish to make sure that all values (Gross sales on this explicit case) are included, then
make sure that to make use of
dropna=False
in your
groupby
An oz of prevention
The primary option to take care of this potential situation is to grasp if in case you have any
NaN
values
in your knowledge. There are a few methods to do this.
You should use pure pandas:
Buyer ID 0
Buyer Identify 0
Area 0
Section 4
Gross sales 0
dtype: int64
There are different instruments like missingno which offer a extra strong interface for exploring
the knowledge.
I’m keen on sidetable. Right here’s the right way to use it after it’s put in and imported:
lacking | whole | p.c | |
---|---|---|---|
Section | 4 | 20 | 20.0 |
Buyer ID | 0 | 20 | 0.0 |
Buyer Identify | 0 | 20 | 0.0 |
Area | 0 | 20 | 0.0 |
Gross sales | 0 | 20 | 0.0 |
Whatever the strategy you utilize, its price retaining in thoughts that it’s worthwhile to know if
you have got any null or
NaN
values in your knowledge and the way you want to deal with them in your evaluation.
The opposite different to utilizing the
dropna
is to explicitly fill within the
values utilizing
fillna
df.fillna('unknown').groupby(['Region', 'Segment']).agg({'Gross sales': 'sum'})
Now the unknown values are explicitly known as out:
Gross sales | ||
---|---|---|
Area | Section | |
APAC | Gold | 215000 |
Platinum | 630000 | |
Silver | 410000 | |
EMEA | Gold | 630000 |
Platinum | 410000 | |
Silver | 430000 | |
unknown | 91000 | |
US | Gold | 1479000 |
Platinum | 1555000 | |
Silver | 1260000 | |
unknown | 890000 |
Conclusion
When working with pandas
groupby
, the outcomes might be shocking if in case you have
NaN
values in your dataframe columns. The default conduct is to drop these values
which implies you may successfully “lose” a few of your knowledge through the course of.
I’ve been bit by this conduct a number of occasions up to now. In some circumstances, it may not be a
massive deal. In others, you would possibly have to sheepishly clarify why your numbers aren’t including up.
Have you ever seen this earlier than? Let me know within the feedback beneath.