Tuesday, April 23, 2024
HomePythonPandas Groupby Warning - Sensible Enterprise Python

Pandas Groupby Warning – Sensible Enterprise Python


Introduction

One of many causes I like utilizing pandas as an alternative of Excel for knowledge evaluation is that it’s
simpler to keep away from sure sorts of copy-paste Excel errors. As nice as pandas is, there
remains to be loads of alternative to make errors with pandas code. This text discusses a
refined situation with pandas
groupby

code that may result in massive errors when you’re not
cautious. I’m scripting this as a result of I’ve occurred upon this up to now nevertheless it nonetheless bit
me massive time only recently. I hope this text might help just a few of you keep away from this mistake.

The Downside

As an instance this drawback, we’ll use a easy knowledge set that reveals gross sales for 20 prospects
and consists of their area and an inner buyer section designation of Platinum, Gold
or Silver. Right here is the full knowledge set:

Buyer ID Buyer Identify Area Section Gross sales
0 740150 Barton LLC US Gold 215000
1 714466 Trantow-Barrows EMEA Silver 430000
2 218895 Kulas Inc EMEA Platinum 410000
3 307599 Kassulke, Ondricka and Metz EMEA   91000
4 412290 Jerde-Hilpert EMEA Gold 630000
5 729833 Koepp Ltd US   230000
6 737550 Fritsch, Russel and Anderson US Gold 630000
7 146832 Kiehn-Spinka US Silver 615000
8 688981 Keeling LLC US Platinum 515000
9 786968 Frami, Hills and Schmidt US Gold 215000
10 239344 Stokes LLC US Silver 230000
11 672390 Kuhn-Gusikowski APAC Platinum 630000
12 141962 Herman LLC APAC Gold 215000
13 424914 White-Trantow US   230000
14 527099 Sanford and Sons US Platinum 615000
15 642753 Pollich LLC US Gold 419000
16 383080 Will LLC US Silver 415000
17 257198 Cronin, Oberbrunner and Spencer US Platinum 425000
18 604255 Halvorson, Crona and Champlin US   430000
19 163416 Purdy-Kunde APAC Silver 410000

The information seems fairly easy. There’s just one numeric column so let’s see what it
totals to.

import pandas as pd

df = pd.read_excel('https://github.com/chris1610/pbpython/uncooked/grasp/knowledge/sales_9_2022.xlsx')
df["Sales"].sum()

We’ve $8,000,000 in gross sales within the spreadsheet. Preserve that quantity in thoughts.

Let’s do some easy evaluation to summarize gross sales by area:

df.groupby(['Region']).agg({'Gross sales': 'sum'})
Gross sales
Area
APAC 1255000
EMEA 1561000
US 5184000

We are able to double examine the math:

df.groupby(['Region']).agg({'Gross sales': 'sum'}).sum()
Gross sales    8000000
dtype: int64

Seems to be good. That’s what we count on. Now let’s see what gross sales appear to be by Section:

df.groupby(['Region', 'Segment']).agg({'Gross sales': 'sum'})

Which yields this desk:

Gross sales
Area Section
APAC Gold 215000
Platinum 630000
Silver 410000
EMEA Gold 630000
Platinum 410000
Silver 430000
US Gold 1479000
Platinum 1555000
Silver 1260000

This seems good. No errors and the desk appears affordable. We should always proceed our evaluation proper?

Nope. There’s a probably refined situation right here. Let’s sum the information to double examine:

df.groupby(['Region', 'Segment']).agg({'Gross sales': 'sum'}).sum()
Gross sales    7019000
dtype: int64

This solely consists of $7,019,000. The place did the opposite $981,000 go? Is pandas damaged?

You may see the problem clearly if we use the
dropna=False

parameter to explicitly
embrace
NaN

values in our outcomes:

df.groupby(['Region', 'Segment'], dropna=False).agg({'Gross sales': 'sum'})

Now we will see the
NaN

mixtures with EMEA and the US groupings:

Gross sales
Area Section
APAC Gold 215000
Platinum 630000
Silver 410000
EMEA Gold 630000
Platinum 410000
Silver 430000
NaN 91000
US Gold 1479000
Platinum 1555000
Silver 1260000
NaN 890000

If we examine the sum, we will see it totals to $8M.

df.groupby(['Region', 'Segment'], dropna=False).agg({'Gross sales': 'sum'}).sum()
Gross sales    8000000
dtype: int64

The pandas documentation may be very clear on this:


dropna:

bool, default True
If True, and if group keys comprise NA values, NA values along with row/column will
be dropped. If False, NA values may even be handled as the important thing in teams.

The take away is that in case your
groupby

columns comprise any NA values,
then it’s worthwhile to make a acutely aware resolution about whether or not or not you wish to embrace these
values within the grouped outcomes.

If you’re okay dropping these values, then use the default
dropna=True

.

Nevertheless, if you wish to make sure that all values (Gross sales on this explicit case) are included, then
make sure that to make use of
dropna=False

in your
groupby

An oz of prevention

The primary option to take care of this potential situation is to grasp if in case you have any
NaN

values
in your knowledge. There are a few methods to do this.

You should use pure pandas:

Buyer ID      0
Buyer Identify    0
Area           0
Section          4
Gross sales            0
dtype: int64

There are different instruments like missingno which offer a extra strong interface for exploring
the knowledge.

I’m keen on sidetable. Right here’s the right way to use it after it’s put in and imported:

lacking whole p.c
Section 4 20 20.0
Buyer ID 0 20 0.0
Buyer Identify 0 20 0.0
Area 0 20 0.0
Gross sales 0 20 0.0

Whatever the strategy you utilize, its price retaining in thoughts that it’s worthwhile to know if
you have got any null or
NaN

values in your knowledge and the way you want to deal with them in your evaluation.

The opposite different to utilizing the
dropna

is to explicitly fill within the
values utilizing
fillna

df.fillna('unknown').groupby(['Region', 'Segment']).agg({'Gross sales': 'sum'})

Now the unknown values are explicitly known as out:

Gross sales
Area Section
APAC Gold 215000
Platinum 630000
Silver 410000
EMEA Gold 630000
Platinum 410000
Silver 430000
unknown 91000
US Gold 1479000
Platinum 1555000
Silver 1260000
unknown 890000

Conclusion

When working with pandas
groupby

, the outcomes might be shocking if in case you have

NaN

values in your dataframe columns. The default conduct is to drop these values
which implies you may successfully “lose” a few of your knowledge through the course of.

I’ve been bit by this conduct a number of occasions up to now. In some circumstances, it may not be a
massive deal. In others, you would possibly have to sheepishly clarify why your numbers aren’t including up.

Have you ever seen this earlier than? Let me know within the feedback beneath.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments