Pandas Groupby Warning – Sensible Enterprise Python

September 26, 2022

158

Introduction

One of many causes I like utilizing pandas as an alternative of Excel for knowledge evaluation is that it’s
simpler to keep away from sure sorts of copy-paste Excel errors. As nice as pandas is, there
remains to be loads of alternative to make errors with pandas code. This text discusses a
refined situation with pandas groupby
code that may result in massive errors when you’re not
cautious. I’m scripting this as a result of I’ve occurred upon this up to now nevertheless it nonetheless bit
me massive time only recently. I hope this text might help just a few of you keep away from this mistake.

The Downside

As an instance this drawback, we’ll use a easy knowledge set that reveals gross sales for 20 prospects
and consists of their area and an inner buyer section designation of Platinum, Gold
or Silver. Right here is the full knowledge set:

	Buyer ID	Buyer Identify	Area	Section	Gross sales
0	740150	Barton LLC	US	Gold	215000
1	714466	Trantow-Barrows	EMEA	Silver	430000
2	218895	Kulas Inc	EMEA	Platinum	410000
3	307599	Kassulke, Ondricka and Metz	EMEA		91000
4	412290	Jerde-Hilpert	EMEA	Gold	630000
5	729833	Koepp Ltd	US		230000
6	737550	Fritsch, Russel and Anderson	US	Gold	630000
7	146832	Kiehn-Spinka	US	Silver	615000
8	688981	Keeling LLC	US	Platinum	515000
9	786968	Frami, Hills and Schmidt	US	Gold	215000
10	239344	Stokes LLC	US	Silver	230000
11	672390	Kuhn-Gusikowski	APAC	Platinum	630000
12	141962	Herman LLC	APAC	Gold	215000
13	424914	White-Trantow	US		230000
14	527099	Sanford and Sons	US	Platinum	615000
15	642753	Pollich LLC	US	Gold	419000
16	383080	Will LLC	US	Silver	415000
17	257198	Cronin, Oberbrunner and Spencer	US	Platinum	425000
18	604255	Halvorson, Crona and Champlin	US		430000
19	163416	Purdy-Kunde	APAC	Silver	410000

The information seems fairly easy. There’s just one numeric column so let’s see what it
totals to.

import pandas as pd

df = pd.read_excel('https://github.com/chris1610/pbpython/uncooked/grasp/knowledge/sales_9_2022.xlsx')
df["Sales"].sum()

We’ve $8,000,000 in gross sales within the spreadsheet. Preserve that quantity in thoughts.

Let’s do some easy evaluation to summarize gross sales by area:

df.groupby(['Region']).agg({'Gross sales': 'sum'})

	Gross sales
Area
APAC	1255000
EMEA	1561000
US	5184000

We are able to double examine the math:

df.groupby(['Region']).agg({'Gross sales': 'sum'}).sum()

Gross sales    8000000
dtype: int64

Seems to be good. That’s what we count on. Now let’s see what gross sales appear to be by Section:

df.groupby(['Region', 'Segment']).agg({'Gross sales': 'sum'})

Which yields this desk:

		Gross sales
Area	Section
APAC	Gold	215000
	Platinum	630000
	Silver	410000
EMEA	Gold	630000
	Platinum	410000
	Silver	430000
US	Gold	1479000
	Platinum	1555000
	Silver	1260000

This seems good. No errors and the desk appears affordable. We should always proceed our evaluation proper?

Nope. There’s a probably refined situation right here. Let’s sum the information to double examine:

df.groupby(['Region', 'Segment']).agg({'Gross sales': 'sum'}).sum()

Gross sales    7019000
dtype: int64

This solely consists of $7,019,000. The place did the opposite $981,000 go? Is pandas damaged?

You may see the problem clearly if we use the dropna=False
parameter to explicitly
embrace NaN
values in our outcomes:

df.groupby(['Region', 'Segment'], dropna=False).agg({'Gross sales': 'sum'})

Now we will see the NaN
mixtures with EMEA and the US groupings:

		Gross sales
Area	Section
APAC	Gold	215000
	Platinum	630000
	Silver	410000
EMEA	Gold	630000
	Platinum	410000
	Silver	430000
	NaN	91000
US	Gold	1479000
	Platinum	1555000
	Silver	1260000
	NaN	890000

If we examine the sum, we will see it totals to $8M.

df.groupby(['Region', 'Segment'], dropna=False).agg({'Gross sales': 'sum'}).sum()

Gross sales    8000000
dtype: int64

The pandas documentation may be very clear on this:

dropna:
bool, default True

If True, and if group keys comprise NA values, NA values along with row/column will
be dropped. If False, NA values may even be handled as the important thing in teams.

The take away is that in case your groupby
columns comprise any NA values,
then it’s worthwhile to make a acutely aware resolution about whether or not or not you wish to embrace these
values within the grouped outcomes.

If you’re okay dropping these values, then use the default dropna=True
.

Nevertheless, if you wish to make sure that all values (Gross sales on this explicit case) are included, then
make sure that to make use of dropna=False
in your groupby

An oz of prevention

The primary option to take care of this potential situation is to grasp if in case you have any NaN
values
in your knowledge. There are a few methods to do this.

You should use pure pandas:

Buyer ID      0
Buyer Identify    0
Area           0
Section          4
Gross sales            0
dtype: int64

There are different instruments like missingno which offer a extra strong interface for exploring
the knowledge.

I’m keen on sidetable. Right here’s the right way to use it after it’s put in and imported:

	lacking	whole	p.c
Section	4	20	20.0
Buyer ID	0	20	0.0
Buyer Identify	0	20	0.0
Area	0	20	0.0
Gross sales	0	20	0.0

Whatever the strategy you utilize, its price retaining in thoughts that it’s worthwhile to know if
you have got any null or NaN
values in your knowledge and the way you want to deal with them in your evaluation.

The opposite different to utilizing the dropna
is to explicitly fill within the
values utilizing fillna

df.fillna('unknown').groupby(['Region', 'Segment']).agg({'Gross sales': 'sum'})

Now the unknown values are explicitly known as out:

		Gross sales
Area	Section
APAC	Gold	215000
	Platinum	630000
	Silver	410000
EMEA	Gold	630000
	Platinum	410000
	Silver	430000
	unknown	91000
US	Gold	1479000
	Platinum	1555000
	Silver	1260000
	unknown	890000

Conclusion

When working with pandas groupby
, the outcomes might be shocking if in case you have
NaN
values in your dataframe columns. The default conduct is to drop these values
which implies you may successfully “lose” a few of your knowledge through the course of.

I’ve been bit by this conduct a number of occasions up to now. In some circumstances, it may not be a
massive deal. In others, you would possibly have to sheepishly clarify why your numbers aren’t including up.

Have you ever seen this earlier than? Let me know within the feedback beneath.

Previous articleA package deal to handle dynamic servers

Next articleGreatest C# programs on Pluralsight

Pandas Groupby Warning – Sensible Enterprise Python

Introduction

The Downside

An oz of prevention

Conclusion

Glow up! A brand new search for Python Reference documentation on Microsoft Study

Is it unlawful to make use of QuillBot?

Django Studying Assets | The PyCharm Weblog

LEAVE A REPLY Cancel reply

Most Popular

Rust’s Protection Towards Use After Free & Use After Transfer

JavaScript Weekly Concern 683: April 11, 2024

Glow up! A brand new search for Python Reference documentation on Microsoft Study

Free Webinar: Future-Proof Your Tech Profession With visionOS

Recent Comments

ABOUT US

POPULAR POSTS

Rust’s Protection Towards Use After Free & Use After Transfer

JavaScript Weekly Concern 683: April 11, 2024

Glow up! A brand new search for Python Reference documentation on Microsoft Study

POPULAR CATEGORY