Discussion:
[m-dev.] Article about memory fragmentation
Paul Bone
2016-10-08 11:32:33 UTC
Permalink
This is a cross post as both communities will probably be interested.

I investigated an issue with memory fragmentation in BDW GC and Mercury and
it proved to be quite interesting, so I wrote it up as a blog post.

http://paul.bone.id.au/2016/10/08/memory-fragmentation-in-boehmgc/

I also learnt a fair amount about the collector in the process, and
explained that in the article too. It may be interesting if you're curious
about how allocation works in the collector.

Please let me know if I've gotten anything wrong, or of course if you have
any other comments.

Cheers.
--
Paul Bone
http://paul.bone.id.au
Bruce Hoult
2016-10-08 19:50:45 UTC
Permalink
Good article, as far as it goes, but it just .. stops .. before the
interesting analysis.

Btw "uncorrectable" should be "uncollectable".

Almost all those block sizes have excellent utilization ratios. A moving
collector could scarcely do better!

But there's definitely room for improvement with the 32 byte size.

I'm very curious what the memory allocation pattern in that makes you end
up with so many blocks with so few live objects.

My best guess is that at some point you actually did have 60 MB of 32 byte
objects, but then most of them became unreachable. But you're the one who
can gather the object lifetime data.

In a long-running program, you'd expect that space to get used again,
multiple times, so it's not really a problem.

But it can also be that there is a usage peak for some object size during
program startup, and then they are not needed again. Unfortunately, there's
very little that a non-moving GC can do about that.

It might be possible to do something about that at a user level though: if
there are interspersed allocations of short lived and long lived (or
pemanent) objects of the same size, then you could pre-allocate a number of
objects (4 KB worth or some multiple) and put them in your own freelist or
some other collection, and then replace the current GC_malloc of them with
taking one from the pre-allocated list. You could do this for either the
short-lived or the long-lived objects, but I suspect that the long-lived
ones might give the easiest benefit. Get them all onto the same memory
pages, that will stay reasonably full, and let entire pages of short-lived
objects get GC'd and those pages reused for objects of other sizes.

The GC can't be some kind of oracle and predict these usage patterns, but
you might be able to.
Post by Paul Bone
This is a cross post as both communities will probably be interested.
I investigated an issue with memory fragmentation in BDW GC and Mercury and
it proved to be quite interesting, so I wrote it up as a blog post.
http://paul.bone.id.au/2016/10/08/memory-fragmentation-in-boehmgc/
I also learnt a fair amount about the collector in the process, and
explained that in the article too. It may be interesting if you're curious
about how allocation works in the collector.
Please let me know if I've gotten anything wrong, or of course if you have
any other comments.
Cheers.
--
Paul Bone
http://paul.bone.id.au
_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
Paul Bone
2016-10-09 03:40:42 UTC
Permalink
Post by Bruce Hoult
Good article, as far as it goes, but it just .. stops .. before the
interesting analysis.
Thanks. It might sound funny but I agree with you. I initially did the
investigation in June and when I found some time recently to write it up I
found I couldn't reproduce the situation. The article was written with the
data that I had collected in June.

If I was to continue, and I think I might, I would profile the same program
and show how the memory for each object size changes over time, as I agree
with you, it's very likely that we do use most of the 32 byte blocks, then
free them.
Post by Bruce Hoult
Btw "uncorrectable" should be "uncollectable".
Almost all those block sizes have excellent utilization ratios. A moving
collector could scarcely do better!
Yeah, this is why I added a note about a semi-space copying collector, since
it would be worse in terms of utilisation. It would be better for locality
however. I'm not sure how well a mark-compact or
mark-sweep-and-sometimes-compact collector would go. I'm aware that there
will always be less than perfect utilization, to give the heap space to grow
into.
Post by Bruce Hoult
But there's definitely room for improvement with the 32 byte size.
I'm very curious what the memory allocation pattern in that makes you end
up with so many blocks with so few live objects.
My best guess is that at some point you actually did have 60 MB of 32 byte
objects, but then most of them became unreachable. But you're the one who
can gather the object lifetime data.
In a long-running program, you'd expect that space to get used again,
multiple times, so it's not really a problem.
I'll try to do that. We know that there are several stages while processing
a document where we throw away large structures that are no-longer needed.
Eg after styles have been computed but before pagination. Prince will
process a document, then terminate and a new process is started for a new
document. It's not a long running process, and therefore memory usage isn't
super-important, but of a client uses Prince very frequently, it can be
important.
Post by Bruce Hoult
But it can also be that there is a usage peak for some object size during
program startup, and then they are not needed again. Unfortunately, there's
very little that a non-moving GC can do about that.
We will also be able to collect data about which constructors of which types
account for this allocation size.
Post by Bruce Hoult
It might be possible to do something about that at a user level though: if
there are interspersed allocations of short lived and long lived (or
pemanent) objects of the same size, then you could pre-allocate a number of
objects (4 KB worth or some multiple) and put them in your own freelist or
some other collection, and then replace the current GC_malloc of them with
taking one from the pre-allocated list. You could do this for either the
short-lived or the long-lived objects, but I suspect that the long-lived
ones might give the easiest benefit. Get them all onto the same memory
pages, that will stay reasonably full, and let entire pages of short-lived
objects get GC'd and those pages reused for objects of other sizes.
The GC can't be some kind of oracle and predict these usage patterns, but
you might be able to.
I'd like to check that I understand. By pre-allocating long lived items,
but not short lived items, we're more likely to guarantees that those objects
are allocated on the same blocks, and not interspersed on other blocks
preventing those blocks from being freed when the short-lived objects are
collected?

I can see the potential, but I'd want to collect more information and
understand which objects, of these sizes, are long or short lived. We could
modify the profiler by adding a word to each object that points back the
call site information (available when profiling is enabled) and increments a
counter there each time that object survives a collection. Then we could
know for each allocation site, whether it creates long or short lived
objects.

Can you or someone else on the BDWGC tell me in what order blocks are put on
the free lists. Do the free spots in mostly-full blocks appear earlier on
the free lists? If things are placed on the free list in the order they are
swept which blocks are swept sooner?

Cheers.
--
Paul Bone
http://paul.bone.id.au
Bruce Hoult
2016-10-09 20:13:46 UTC
Permalink
Post by Paul Bone
I'd like to check that I understand. By pre-allocating long lived items,
but not short lived items, we're more likely to guarantees that those objects
are allocated on the same blocks, and not interspersed on other blocks
preventing those blocks from being freed when the short-lived objects are
collected?
Right.

In fact there is an existing API which would be useful for this:

void* GC_malloc_many(size_t objSz)

... which is implemented by ...

void GC_generic_malloc_many(size_t objSz, int objKind, void **result)

It might make sense to have GC_malloc_many_atomic() or the uncollectable
variants, but they din't exist at the moment and you can get the effect now
by a direct call to GC_generic_malloc_many.

This call returns some number of objects of the same size (rounded up
minimally -- just to the next multiple of 16 bytes), all of which are
located in the same gc/VM page. The first bytes of each object point to the
next object, and the return value is a pointer to the first object.

The function tries to find an existing block of objects of that size,
allocating an entire new block only if one can't be found. It might make
sense to add a parameter (or another function) to force always using an
entire (new or reclaimed) VM page for the returned list.


Can you or someone else on the BDWGC tell me in what order blocks are put
Post by Paul Bone
on
the free lists. Do the free spots in mostly-full blocks appear earlier on
Post by Paul Bone
the free lists? If things are placed on the free list in the order they are
swept which blocks are swept sooner?
First, the free list for objects of a particular size (and kind) always[1]
contain only objects from a single gc/VM page. The objects are ordered by
memory address within the gc page.

When the free list for objects of a particular size is empty, a single page
of objects of the same size is swept and any objects that were not marked
(reachable) at the previous GC are added to the free list. Some objects in
the page may have since become unreachable and this is not detected, but
it's impossible for a previously unreachable object to become referenced --
since it was unreachable by the app :-)

If a page is "nearly full" then it is skipped, on the assumption that it
takes a lot of time for little benefit (speed/space tradeoff).

The order in which pages are swept is determined by the ok_reclaim_list for
blocks with that size of object, linked together by the hb_next field. This
is set up by GC_apply_to_all_blocks() after each GC, which processes pages
in order of memory address.

[1] unless you use GC_free(), in which case anything freed by GC_free() is
at the start of the free list and will be the first object(s) of that
size(&kind) to be returned from GC_malloc().
Paul Bone
2016-10-09 21:57:18 UTC
Permalink
Post by Bruce Hoult
Post by Paul Bone
I'd like to check that I understand. By pre-allocating long lived items,
but not short lived items, we're more likely to guarantees that those objects
are allocated on the same blocks, and not interspersed on other blocks
preventing those blocks from being freed when the short-lived objects are
collected?
Right.
void* GC_malloc_many(size_t objSz)
... which is implemented by ...
void GC_generic_malloc_many(size_t objSz, int objKind, void **result)
It might make sense to have GC_malloc_many_atomic() or the uncollectable
variants, but they din't exist at the moment and you can get the effect now
by a direct call to GC_generic_malloc_many.
This call returns some number of objects of the same size (rounded up
minimally -- just to the next multiple of 16 bytes), all of which are
located in the same gc/VM page. The first bytes of each object point to the
next object, and the return value is a pointer to the first object.
The function tries to find an existing block of objects of that size,
allocating an entire new block only if one can't be found. It might make
sense to add a parameter (or another function) to force always using an
entire (new or reclaimed) VM page for the returned list.
Cheers.
Post by Bruce Hoult
Can you or someone else on the BDWGC tell me in what order blocks are put
Post by Paul Bone
on
the free lists. Do the free spots in mostly-full blocks appear earlier on
Post by Paul Bone
the free lists? If things are placed on the free list in the order they are
swept which blocks are swept sooner?
First, the free list for objects of a particular size (and kind) always[1]
contain only objects from a single gc/VM page. The objects are ordered by
memory address within the gc page.
When the free list for objects of a particular size is empty, a single page
of objects of the same size is swept and any objects that were not marked
(reachable) at the previous GC are added to the free list. Some objects in
the page may have since become unreachable and this is not detected, but
it's impossible for a previously unreachable object to become referenced --
since it was unreachable by the app :-)
I have no problems understanding that last point ;-) If I doubted it I
wonder how I could cope with memory management at all! Actually, since
BoehmGC is conservative maybe you're asserting an extra assurance, that in
this situation BoehmBC won't misunderstand a value that looks like a pointer
to previously used but not yet swept memory as a pointer. This is an edge
case that I hadn't thought of before but now seems pretty obvious.
Post by Bruce Hoult
If a page is "nearly full" then it is skipped, on the assumption that it
takes a lot of time for little benefit (speed/space tradeoff).
That's a reasonable assumption. Has anyone tested different policies here?
As I understand it, sweeping should be relatively fast, especially for mostly
full pages. It starts by checking the mark bits (a medium-sized but
sequential memory read) and finding any unmarked objects it threads them
together in a linked list (writes to a number of different cache lines).

SweepCost = ReadMarkBits + N*WriteLinks
SweepBenifit = N

Sweeping a mostly full page is the cheapest sweep, but has little benifit.
Sweeping a mostly empty page has a higher cost (but writes are amortized if
there is more than one per cache line) and a great benifit. It seems that
skipping mostly full objects mostly reduces how frequently sweeping is
required, and only affects the cost/benifit of sweeping slightly. Please
tell me if you disagree, this is the most I've actually worked with memory
management.
Post by Bruce Hoult
The order in which pages are swept is determined by the ok_reclaim_list for
blocks with that size of object, linked together by the hb_next field. This
is set up by GC_apply_to_all_blocks() after each GC, which processes pages
in order of memory address.
[1] unless you use GC_free(), in which case anything freed by GC_free() is
at the start of the free list and will be the first object(s) of that
size(&kind) to be returned from GC_malloc().
Cheers.
--
Paul Bone
http://paul.bone.id.au
Bruce Hoult
2016-10-09 22:41:41 UTC
Permalink
Post by Paul Bone
Post by Bruce Hoult
If a page is "nearly full" then it is skipped, on the assumption that it
takes a lot of time for little benefit (speed/space tradeoff).
That's a reasonable assumption. Has anyone tested different policies here?
As I understand it, sweeping should be relatively fast, especially for mostly
full pages. It starts by checking the mark bits (a medium-sized but
sequential memory read) and finding any unmarked objects it threads them
together in a linked list (writes to a number of different cache lines).
SweepCost = ReadMarkBits + N*WriteLinks
SweepBenifit = N
Sweeping a mostly full page is the cheapest sweep, but has little benifit.
Sweeping a mostly empty page has a higher cost (but writes are amortized if
there is more than one per cache line) and a great benifit. It seems that
skipping mostly full objects mostly reduces how frequently sweeping is
required, and only affects the cost/benifit of sweeping slightly. Please
tell me if you disagree, this is the most I've actually worked with memory
management.
It looks as if this feature was added in version 5.0, and then the
implementation changed to the current one in version 6.0.

Those are quite some time ago (I couldn't immediately find a dated release
list, but I think 5.0 was around 2000?), and computers have changed a lot,
so the same assumptions might not apply now. It might be worth someone
revisiting it.
Paul Bone
2016-10-09 23:40:13 UTC
Permalink
Post by Paul Bone
I investigated an issue with memory fragmentation in BDW GC and Mercury and
it proved to be quite interesting, so I wrote it up as a blog post.
I received some feedback via twitter regarding the Memory Pool System (MPS),
another GC for uncooperative environments.

https://twitter.com/glaebhoerl/status/784737682706538496

IIRC we've tried this in the past and found that it didn't perform as well
as BDWGC. But I wasn't involved with Mercury at the time and things may
have changed. @glaebhoerl provided this link
https://gitlab.com/embeddable-common-lisp/ecl/issues/126 suggesting that it
might be worth a look as ti may perform better. I know very little about
MPS so there's not much I can say about this. But I said I'd pass the
information on to the other Mercury devs.

Cheers.
--
Paul Bone
http://paul.bone.id.au
Zoltan Somogyi
2016-10-10 00:06:53 UTC
Permalink
Post by Paul Bone
I received some feedback via twitter regarding the Memory Pool System (MPS),
another GC for uncooperative environments.
https://twitter.com/glaebhoerl/status/784737682706538496
IIRC we've tried this in the past and found that it didn't perform as well
as BDWGC. But I wasn't involved with Mercury at the time and things may
https://gitlab.com/embeddable-common-lisp/ecl/issues/126 suggesting that it
might be worth a look as ti may perform better. I know very little about
MPS so there's not much I can say about this. But I said I'd pass the
information on to the other Mercury devs.
When we last tried MPS, MPS wasn't just slower than BDW. Reasonably often,
in maybe 10-20% of cases, it was *much* slower, as in go-get-a-cup-of-coffee
slower. This was despite the fact that our reason for looking into it then was that
it was said to be faster than BDW at the time.

Even if it has improved in the meantime, I am pretty sure there is no point
in looking into it any further, for two reasons. One, I am skeptical that they
could improve things dramatically enough to catch up to BDW, at least for
usage scenarios like ours. Two, with today's memory sizes, gc is, in most
cases, a relatively small fraction of execution time. Therefore even improvements
in the 10-20% range in gc time would translate to only very minor improvements
in overall execution time. The investment required just isn't worth it.

Zoltan.
Paul Bone
2016-10-10 00:35:31 UTC
Permalink
Post by Zoltan Somogyi
Post by Paul Bone
I received some feedback via twitter regarding the Memory Pool System (MPS),
another GC for uncooperative environments.
https://twitter.com/glaebhoerl/status/784737682706538496
IIRC we've tried this in the past and found that it didn't perform as well
as BDWGC. But I wasn't involved with Mercury at the time and things may
https://gitlab.com/embeddable-common-lisp/ecl/issues/126 suggesting that it
might be worth a look as ti may perform better. I know very little about
MPS so there's not much I can say about this. But I said I'd pass the
information on to the other Mercury devs.
When we last tried MPS, MPS wasn't just slower than BDW. Reasonably often,
in maybe 10-20% of cases, it was *much* slower, as in go-get-a-cup-of-coffee
slower. This was despite the fact that our reason for looking into it then was that
it was said to be faster than BDW at the time.
Even if it has improved in the meantime, I am pretty sure there is no point
in looking into it any further, for two reasons. One, I am skeptical that they
could improve things dramatically enough to catch up to BDW, at least for
usage scenarios like ours. Two, with today's memory sizes, gc is, in most
cases, a relatively small fraction of execution time. Therefore even improvements
in the 10-20% range in gc time would translate to only very minor improvements
in overall execution time. The investment required just isn't worth it.
When I was last benchmarking the impact of GC (2011) I found that it
accounted for a lot of the program's runtime. These figures are from
section 3.1 of my thesis, for the icfp2000 benchmark, which probably has a
slightly higher than average allocation rate.

1 Mutator thread and 1 GC thread:
Mutator 20.9s 55.2%
GC 16.9s 44.8%
Total 37.8s
There were 384 full heap collections.

4 Mutator threads and 4 GC threads:
Mutator 6.4s 47.6%
GC 7.1s 52.4%
Total 13.5s
There were 288 full heap collections.

When parallelising either the mutator or the collector, the ratio is closer
to 7:3 You're right that Amdahl's law applies here, but GC is quite
significant for us.

To collect data like this you can set GC_PRINT_STATS=1 in your environment
and add together the collection times (or multiply the average by the
number of collections) and compare this to the total time.

Nevertheless I agree that trying MPS is probably not going to to be a great
use of our time. I don't know of any specific reason why it would be faster
than BDW GC, why it would have changed from when it was measured earlier.

Cheers.
--
Paul Bone
http://paul.bone.id.au
Bruce Hoult
2016-10-10 00:07:15 UTC
Permalink
Post by Paul Bone
Post by Paul Bone
I investigated an issue with memory fragmentation in BDW GC and Mercury
and
Post by Paul Bone
it proved to be quite interesting, so I wrote it up as a blog post.
I received some feedback via twitter regarding the Memory Pool System (MPS),
another GC for uncooperative environments.
https://twitter.com/glaebhoerl/status/784737682706538496
IIRC we've tried this in the past and found that it didn't perform as well
as BDWGC. But I wasn't involved with Mercury at the time and things may
https://gitlab.com/embeddable-common-lisp/ecl/issues/126 suggesting that it
might be worth a look as ti may perform better. I know very little about
MPS so there's not much I can say about this. But I said I'd pass the
information on to the other Mercury devs.
The Dylan programming language has had several implementations, using a
variety of GCs:

Gwydion Dylan, originally at CMU on HP/UX, subsequently on probably every
popular Unix-like, has always used BDWGC.

Open Dylan, nee Functional Developer, nee Harlequin Dylan was developed at
Harlequin, who also developed the Memory Pool System. Open Dylan started on
Windows as a native compiler, and has been ported to 32 bit x86 Linux, also
as a native compiler. These versions use MPS. It has also been ported to
other platforms/CPUs and to 64 bit using a C back end. This uses BDWGC.

MPS was said to have 30 man-years of development work in it at the point
that Harlequin spun it off. The overall Dylan enviroment was over 100 man
years. (MPS was also used in Harlequin's ML and Lisp products, and maybe
others)

There is no doubt that MPS is very flexible, and can use a variety of
different allocation and collection techniques, both precise and
conservative.

I don't recall MPS benchmarking any better than BDWGC with Open Dylan.
Maybe the memory use is lower, but I'm not sure about that. There's
probably some data in the Dylan newsgroup or mailing list somewhere, but
it's a few years now since I've been actively involved.
Sebastian Godelet
2016-10-10 11:29:31 UTC
Permalink
Hi Paul
Post by Paul Bone
Please let me know if I've gotten anything wrong, or of course if you have
any other comments.
With Mercury we have the advantage of easily testing the impact of a mark and sweep collector like BDWGC and a generational garbage collector such as SGEN from Mono.
I'm not sure though if your benchmarking programs work with a high-level backend such as the C# backend,
If yes you could just specify the mono runtime via "mono --gc=boehm" and "mono --gc=sgen"
I think that would lead to a valid comparative analysis, unfortunately not for a C backend.

As Java and C# both switched to a generational GC which I is in both cases implemented via a moving GC I think that this should be explored.
On the other hand the paper Compile time garbage collection for Mercury https://mercurylang.org/documentation/papers/CW2004_03_mazur.pdf or more recently the approach the Rust language is taking are interesting ideas to explore. Although I think creating a Rust backend would be a rather challenging task.

Cheers,

Sebastian
Paul Bone
2016-12-30 00:19:08 UTC
Permalink
Post by Paul Bone
This is a cross post as both communities will probably be interested.
I investigated an issue with memory fragmentation in BDW GC and Mercury and
it proved to be quite interesting, so I wrote it up as a blog post.
http://paul.bone.id.au/2016/10/08/memory-fragmentation-in-boehmgc/
I also learnt a fair amount about the collector in the process, and
explained that in the article too. It may be interesting if you're curious
about how allocation works in the collector.
After writing this article back in October I receives dome feedback from
Bruce Hoult on the BDW GC mailing lists. Based on his feedback I've written
a follow-up article
http://paul.bone.id.au/2016/12/29/more-about-memory-fragmentation/ which I
hope will also be interesting.

As before, Please let me know if I've gotten anything wrong, or of course if
you have any other comments.

I've put my changes to BDW GC in a pull request against master,
I have the same changes in a branch on my own github for the Mercury version
of BDW GC. I'll back-port any feedback onto the Mercury version before I
commit it.

Have a happy new year.
--
Paul Bone
http://paul.bone.id.au
Continue reading on narkive:
Loading...