Thursday, March 28, 2024
HomeGolangSudden efficiency degradation on a number of Go routines - Technical Dialogue

Sudden efficiency degradation on a number of Go routines – Technical Dialogue


Hello,
I’m questioning why my algorithm for producing information doesn’t carry out higher when two Go routines have been began to generate the info.
I ran the take a look at on a multi-core machine, which means that the 2 Go routines can run in parallel. Every of those two Go routines solely generates half of the full information quantity to be generated. I due to this fact assumed that the full working time ought to be about half as lengthy in comparison with a single Go routine.
It could be nice if somebody might evaluation the hooked up code pattern under. There is likely to be one thing fallacious with my method, however I couldn’t establish it.
Thanks!

package deal predominant

import (
	"fmt"
	"math/rand"
	"sync"
	"time"
)

func gendata(numGoRoutines, numElements int) []int {
	var wg sync.WaitGroup
	wg.Add(numGoRoutines)
	rand.Seed(time.Now().UnixNano())

	ssi := make([][]int, numGoRoutines)
	for i := 0; i < numGoRoutines; i++ {
		ssi[i] = make([]int, 0, numElements/numGoRoutines)
		go func(i int) {
			defer wg.Achieved()
			for j := 0; j < numElements/numGoRoutines; j++ {
				ssi[i] = append(ssi[i], rand.Intn(10))
			}
		}(i)
	}
	wg.Wait()

	// Create the consequence by appending all of the sub-slices into one consequence slice
	var res []int
	for i := 0; i < numGoRoutines; i++ {
		res = append(res, ssi[i]...)
	}
	return res
}

func predominant() {
	n := 100000000

	// One Go routine
	g := 1
	t1 := time.Now()
	gendata(g, n)
	t2 := time.Now()
	fmt.Println("Knowledge era of", n, "components with", g, "Go routines took", t2.Sub(t1))

	// Two Go routines
	g = 2
	t1 = time.Now()
	gendata(g, n)
	t2 = time.Now()
	fmt.Println("Knowledge era of", n, "components with", g, "Go routines took", t2.Sub(t1))
}

math/rand produces a predictable sequence of pseudo-random values. To make sure that successive calls to math/rand.Intn all the time produce the identical values, the supply is protected with a mutex, so including goroutines received’t assist.

As an alternative, use math/rand.NewSource and move that to rand.New for every goroutine individually. You can not share a rand.Supply between goroutines as a result of solely the worldwide supply is protected for concurrent use. That ought to permit a number of cores to generate values.

2 Likes

That’s a very good level. It improved the efficiency. Now, the runtime for the info era by one goroutine and for the info era by two goroutines is nearly equal.

Nonetheless, I anticipated the runtime to be roughly halved with the two-goroutine method.
I additionally examined with a set worth for the info to be appended to the slice, e.g.

ssi[i] = append(ssi[i], 1)

The consequence is similar. The runtime is similar for each variants. It appears there’s something else which influences the runtime of the 2 goroutines. I nonetheless marvel why this occurs…

I like to recommend wanting into the runtime/pprof package deal to analyze the place the timw is being spent.

I did extra analysis and was capable of additional enhance the runtime of the perform.
Along with the change relating to the random supply (thanks for that trace) I’ve modified the next two issues:

  1. The calculation of the exit standards of the for-loop is carried out solely as soon as earlier than getting into the loop
  2. new components are inserted into the slice utilizing the suitable slice indices
func gendata(numGoRoutines, numElements int) []int {
	var wg sync.WaitGroup
	wg.Add(numGoRoutines)

	ssi := make([][]int, numGoRoutines)
	for i := 0; i < numGoRoutines; i++ {
		ssi[i] = make([]int, numElements/numGoRoutines)
		go func(i int) {
			defer wg.Achieved()
            s := rand.NewSource(time.Now().UnixNano())
	        r := rand.New(s)
            cond := numElements / numGoRoutines
			for j := 0; j < cond; j++ {
				ssi[i][j] = r.Intn(10)
			}
		}(i)
	}
	wg.Wait()

	// Create the consequence by appending all of the sub-slices into one consequence slice
	var res []int
	for i := 0; i < numGoRoutines; i++ {
		res = append(res, ssi[i]...)
	}
	return res
}

In fact! You’re proper; I’m pissed off at myself for not noticing this! Each time you append to the slice, you’re rewriting the brand new size again into ss[i] which is adjoining in reminiscence to ss[i±1] which is probably going leading to false sharing which might nullify any good thing about parallelism.

1 Like

So each single entry to ssi causes a cache miss. How does this nullify advantages of parallelism, when one goroutine or many goroutines, all have the identical variety of cache misses ?

@sabitor the variable identify ssi may be very poorly chosen. Give it a major identify, and it might enhance readability by loads.

No, not a cache miss: The ss[0] and ss[1] slices are adjoining in reminiscence and each time you append to the slice, the size discipline shall be written to and “clobber” the cache line that each slices stay on, in order that different cores must keepbreloading their copies of the slice from predominant reminiscence.

When you preallocate the slices so that you just’re writing into their backing arrays, however not to the precise slice headers, every core is writing to its personal reminiscence in several cache traces so that you don’t need to take care of the false sharing scenario.

Why would the opposite goroutines maintain reloading ? How does that work in apply ?
Assume goroutine A modifies a slice X, and writes that to sluggish reminiscence. That slice’s len and cap may also change.

Then, one other goroutine B, needs to do one thing with one other slice Y, in the identical cache line. How does that work ? The entire machine might want to learn the cache line, as soon as B needs to alter Y’s len or cap (earlier than it needs to switch it). All goroutines share the identical cache.

What do you imply by reload ? The phrase reload appears complicated. As soon as a goroutine B reads the cap/len/pointer of slice Y, that’s it. It would now not re-read these (and the cache line) up till the following time it executes one other command studying something from that cache line. That could be a easy cache miss (the cache line that B has was invalidated exactly when A modified X).

There isn’t a mechanism pushing info to B (reload), merely B causes the machine to tug the cache line once more (cache miss).

Am I appropriate ? What am I lacking ? It appears to me like I don’t perceive how invalidating a cache line works, almost about the goroutines share a standard single copy of it, my assumption being that they’ll solely learn it once more (solely as soon as) when they need information from it.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments