Discussion:
[m-dev.] deleting tags_high
Zoltan Somogyi
2018-06-14 13:35:31 UTC
Permalink
Currently we support three tag_methods:

- storing primary tags in the least significant 2 or 3 bits of a word,
(tags_low)

- storing primary tags in the most significant N bits of a word, where
N may be more than 3 (tags_high)

- not storing primary tags bits at all, distinguishing functors using
tags stored in memory (tags_none).

I propose that we delete support for tags_high. We started with that
because we knew it worked, since it is the method that Prolog systems
have traditionally used. However, once we got tags_low working,
I don't think we ever used tags_high in anger more than once,
and that one occasion was for benchmarks to show that tags_low
was faster :-(

The motivation for raising this issue now this is that I don't want to
write code that adds mktag and unmktag unary operations to the
LLDS and MLDS code being generated, when I know that both are no-ops
unless someone selects tags_high. With my proposal, we should be able
to delete the mktag and unmktag unops; they should *never* be needed.

Any objections?

Zoltan.
Peter Wang
2018-06-14 14:33:03 UTC
Permalink
Post by Zoltan Somogyi
I propose that we delete support for tags_high.
No objection from me.

Peter
Julien Fischer
2018-06-14 17:47:29 UTC
Permalink
Post by Zoltan Somogyi
I propose that we delete support for tags_high. We started with that
because we knew it worked, since it is the method that Prolog systems
have traditionally used. However, once we got tags_low working,
I don't think we ever used tags_high in anger more than once,
and that one occasion was for benchmarks to show that tags_low
was faster :-(
The motivation for raising this issue now this is that I don't want to
write code that adds mktag and unmktag unary operations to the
LLDS and MLDS code being generated, when I know that both are no-ops
unless someone selects tags_high. With my proposal, we should be able
to delete the mktag and unmktag unops; they should *never* be needed.
Any objections?
No, go ahead.

Julien.
Zoltan Somogyi
2018-06-14 18:22:08 UTC
Permalink
Post by Julien Fischer
No, go ahead.
Done. The diff is simple enough not to need review.

Zoltan.
Paul Bone
2018-07-27 00:54:47 UTC
Permalink
Post by Zoltan Somogyi
- storing primary tags in the least significant 2 or 3 bits of a word,
(tags_low)
- storing primary tags in the most significant N bits of a word, where
N may be more than 3 (tags_high)
- not storing primary tags bits at all, distinguishing functors using
tags stored in memory (tags_none).
I propose that we delete support for tags_high. We started with that
because we knew it worked, since it is the method that Prolog systems
have traditionally used. However, once we got tags_low working,
I don't think we ever used tags_high in anger more than once,
and that one occasion was for benchmarks to show that tags_low
was faster :-(
It surprises me that tags_low is faster. Hrm, although the GC would need to
handle this the way it handles low bits, which I don't know but it might
treat them as interior pointers. Oh, one problem could be that more bit
patterns could be considered pointers, previously any bit pattern with one
of these bits set would be definitely not a pointer. So using high tag bits
could retain more memory.

One benefit of tags_high is that on x86_64 (at least at the moment) there
are 16 bits available at the high end of a pointer.

https://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details

When I've mentioned this in the past the concern was that it might change in
the future, or we may have to support something different on other
architectures.

But I learnt something new recently. Because SpiderMonkey (the JavaScript
engine in Firefox) uses NaN boxing to represent values. That is numbers in
JS are doubles, so objects are encoded by using the lower 48 bits of some
encodings of NaN, and other NaNs generated by arithmetic expressions are
normalised. Therefore all pointers to objects need to have their high 17
bits clear. But we basically get that for free because of the address space
behaviour on x86_64. It can be a problem on other architectures though, and
when it is mmap can be given a hint about where in the virtual address space
to allocate pages:

https://searchfox.org/mozilla-central/source/js/src/gc/Memory.cpp#554

So if using 16 or maybe 17 bits as high tag bits sounds interesting, this is
an option. I don't know if high tag bits will be better or worse, but I do
think it is interesting.

Also remember that BoehmGC's minimum allocation and alignment is two machine
words. So you could use 4 or 3 low tag bits.
--
Paul Bone
http://paul.bone.id.au
Zoltan Somogyi
2018-07-27 01:50:04 UTC
Permalink
Post by Paul Bone
It surprises me that tags_low is faster.
It shouldn't. With low tags, given a tagged pointer, subtracting the known
primary tag (e.g. -1) and adding the offset of the desired field (e.g. +8)
to the value of the base pointer can be done together (by adding -1+8 = +7
to the value of the tagged pointer). Load and store instructions in most
instruction sets include the ability to specify such small offsets, so the arithmetic
is not incurring any cost over the cost of the load or store itself. With high tags,
the combined offsets cannot possibly fit into the load or store instruction
(they would need 64 bits, whereas load/store instructions typically have room
for 8 or 16 bits). The masking off of the high tag will typically take two or three
instructions (load the tag value constant in the low order bits of a register,
shift it up to the required bit positions, then mask), and *then* do the load
or store. Since the address to access depends on the result of the mask op,
there is no room for instruction level parallelism either. This is why tags high
is not just slower, but significantly slower than tags low. It also has more instructions,
making the I-cache less effective.
Post by Paul Bone
Hrm, although the GC would need to
handle this the way it handles low bits, which I don't know but it might
treat them as interior pointers.
Yes, we register 0-3 (on 32 bit machines) and 0-7 (on 64 machines)
as possible offsets for boehm gc, so any value at any such offset to
the start of an allocated block counts as a pointer to the block.
Post by Paul Bone
One benefit of tags_high is that on x86_64 (at least at the moment) there
are 16 bits available at the high end of a pointer.
https://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details
Look up reports of the transition from the IBM S/360 to the S/370 in the 1970s.
Both used 32 bits words, but addresses were only 24 bits on the S/360.
People who stashed other info in the top 8 bits of words containing pointers
regretted it when the S/370 came along, which extended pointer size to 32 bits
(though IBM literature calls them 31 bit, since the top bit had another use).

The rule "In addition, the AMD specification requires that the most significant
16 bits of any virtual address, bits 48 through 63, must be copies of bit 47"
on the page you link to was specified by AMD specifically to avoid any similar
problems. I know because one of the architects involved said so on comp.arch
at the time.

However, just because there are 16 bits available NOW does not mean that
those same 16 bits will ALWAYS be available. The page you link to even has
diagrams illustrating this fact.
Post by Paul Bone
So if using 16 or maybe 17 bits as high tag bits sounds interesting, this is
an option. I don't know if high tag bits will be better or worse, but I do
think it is interesting.
For the reason I gave above, it will be worse.
Post by Paul Bone
Also remember that BoehmGC's minimum allocation and alignment is two machine
words. So you could use 4 or 3 low tag bits.
Not all memory cells containing Mercury terms are in memory allocated by boehm;
some are in statically generated data structures. We would need to ensure that
these are aligned on 16 byte boundaries as well. Does anyone know how portable
gcc's __attribute__((aligned(16))) is?

(It would be nice to have someone on the C standardization committee again :-)

Zoltan.
Paul Bone
2018-07-27 02:14:16 UTC
Permalink
Post by Zoltan Somogyi
Post by Paul Bone
It surprises me that tags_low is faster.
It shouldn't. With low tags, given a tagged pointer, subtracting the known
primary tag (e.g. -1) and adding the offset of the desired field (e.g. +8)
to the value of the base pointer can be done together (by adding -1+8 = +7
to the value of the tagged pointer). Load and store instructions in most
instruction sets include the ability to specify such small offsets, so the arithmetic
is not incurring any cost over the cost of the load or store itself. With high tags,
the combined offsets cannot possibly fit into the load or store instruction
(they would need 64 bits, whereas load/store instructions typically have room
for 8 or 16 bits). The masking off of the high tag will typically take two or three
instructions (load the tag value constant in the low order bits of a register,
shift it up to the required bit positions, then mask), and *then* do the load
or store. Since the address to access depends on the result of the mask op,
there is no room for instruction level parallelism either. This is why tags high
is not just slower, but significantly slower than tags low. It also has more instructions,
making the I-cache less effective.
Yeah, that makes sense, plus the use of the extra register (small compared
to the other operations). This is also going to be slow when you're
following pointers from one struct to another, serialising them all with the
memory reads.
Post by Zoltan Somogyi
Post by Paul Bone
Hrm, although the GC would need to
handle this the way it handles low bits, which I don't know but it might
treat them as interior pointers.
Yes, we register 0-3 (on 32 bit machines) and 0-7 (on 64 machines)
as possible offsets for boehm gc, so any value at any such offset to
the start of an allocated block counts as a pointer to the block.
Post by Paul Bone
One benefit of tags_high is that on x86_64 (at least at the moment) there
are 16 bits available at the high end of a pointer.
https://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details
Look up reports of the transition from the IBM S/360 to the S/370 in the 1970s.
Both used 32 bits words, but addresses were only 24 bits on the S/360.
People who stashed other info in the top 8 bits of words containing pointers
regretted it when the S/370 came along, which extended pointer size to 32 bits
(though IBM literature calls them 31 bit, since the top bit had another use).
Yes, in that link I provided you can also see where Mozilla hit this problem
on some archiectures. In the context of JS it does make sense so that NaN
boxing is possible.
Post by Zoltan Somogyi
The rule "In addition, the AMD specification requires that the most significant
16 bits of any virtual address, bits 48 through 63, must be copies of bit 47"
on the page you link to was specified by AMD specifically to avoid any similar
problems. I know because one of the architects involved said so on comp.arch
at the time.
:-) Yes, you'd need to clear those bits (0 because it's in user space)
before dereferencing the pointer, and you'd have the problem you mentioned
above, with the load immediate, shift and mask.
Post by Zoltan Somogyi
However, just because there are 16 bits available NOW does not mean that
those same 16 bits will ALWAYS be available. The page you link to even has
diagrams illustrating this fact.
Yes, I acknowledged that in my first e-mail, that was something you said
when I raised this previously. But I raised it again because I had
forgotten that the load+shift+mask cost more than the free offset
calculation.
Post by Zoltan Somogyi
Post by Paul Bone
So if using 16 or maybe 17 bits as high tag bits sounds interesting, this is
Also remember that BoehmGC's minimum allocation and alignment is two machine
words. So you could use 4 or 3 low tag bits.
Not all memory cells containing Mercury terms are in memory allocated by boehm;
some are in statically generated data structures. We would need to ensure that
these are aligned on 16 byte boundaries as well. Does anyone know how portable
gcc's __attribute__((aligned(16))) is?
Good point.
Post by Zoltan Somogyi
(It would be nice to have someone on the C standardization committee again :-)
:-)
--
Paul Bone
http://paul.bone.id.au
Loading...