[m-dev.] for discussion: design issue for new integer types

Post by Julien Fischer
Just a reminder: please ensure you have updated to rotd-2016-10-26
by Monday; I will be commiting part 2 of the uint change then.

I have updated.

Post by Julien Fischer
In order to not overload the current type checker, literals for each integer
type will need to be lexically distinct. My suggestion is that each integer
type have a distinguishing suffix. For the fixed size integer types these
i8, i16, i32, i64
u8, u16, u32, u64

I think those are fine.

Post by Julien Fischer
i, u
iw, uw (where w == "word sized")

I prefer i and u.

Post by Julien Fischer
The suffix would not be required literals of type 'int'.

I don't think anyone would argue against that.

Post by Julien Fischer
Should the suffixes use only lowercase 'u' and 'i' or would uppercase
also be acceptable? Or uppercase only?

I would strongly argue against allowing I as a suffix. I am probably
the only one on this list who is old enough to have used a typewriter
whose numbers did not include 0 and 1: you were supposed to type
'o' instead of '0', and 'l' instead of '1'. This works for human readers
precisely because an 'l' is easily visually confusable with '1'. For them,
this fact allowed them to save the cost of a key. For us, the fact that
an uppercase i is also visually confusable with 1 would be
a fruitful source of confusing bugs, where people *think* they are
looking at 421, but are actually looking at 42 with an I suffix.

On the basis of symmetry, this would argue against U as well.

Post by Julien Fischer
As an aside: it's long since time we allowed some form of separator
between groups of digits in integer (and float) literals. I propose
that we allow '_' between digits as in Java and C#.

I agree that is a good idea.

A followup question: should we require that the _s be where western
convention dictates the decimal commas should go, i.e. between
every third digit? I for one would prefer that, but people using the
indian number system, which puts commas around groups of *two*
digits above the thousands, would probably prefer that there
not be such a rule (look up "lakh" or "crore" on wikipedia).

We would need to delete the _s at some point anyway. If we do it
in the compiler, we can make the coding doing the deletion
generate a warning if the _s are in the "wrong" place, with the
notion of "wrong" being selected by compiler options such as
--warn-misplaced-integer-underscores-{western,indian}.

When the compiler finds e.g. an i8 suffix on any integer outside
the -128 to 127 range, we want to generate an error message anyway.
The _ check could be done at the same time.

I don't think the *scanner* should do such checks, because
it would result in suboptimal error messages. The compiler
has access to error_util.m; the library does not.

Post by Julien Fischer
2. Automatic coercion and promotion.
There won't be any in Mercury. If you are converting between integer
types then you will be required to say so.

Agreed.

What form would those explicit coercions take? Would we have
a specific function for each pair of integer types? How about
e.g. i16 to float: would you have to convert the i16 to int first?

Post by Julien Fischer
3. Representation of new integer types in the term type.
How should the new new integer types be represented in the term.const/0
type?
:- type const
---> atom(string)
; integer(int)
; big_integer(integer_base, integer)
% An integer that is too big for `int'.
; unsigned_integer(uint)
; big_unsigned_integer(integer_base, integer).
% An unsigned integer that is too big for `uint'.
; string(string)
; float(float)
; implementation_defined(string)
; uint8(uint8)
; uint16(uint16)
; uint32(uint32)
; uint64(uint64)
; int8(int8)
; int16(int16)
; int32(int32)
; int64(int64).

I would instead suggest that we keep just the existing
integer and big_integer functors, and add a new argument to both.
This argument would say int vs uint, and 8 vs 16 vs 32 vs 64 vs
default size, *purely on the basis of the suffix, without any check
in the scanner*, for reason given above.

To allow the underscore check mentioned above, the existing argument
of the integer and big_integer functors would need to be a string,
with the conversion done in the compiler. However, doing that
would erase the need for the big_integer functor, since the integer
functor would then be able to represent everything it can.

Two other things. First, some people may be using the library's
lexer and parser modules for their own purposes (e.g. Prolog interpreters),
so if we change their basic representation, we should add their old
versions to e.g. extras under names such as old_{lexer,parser}.m.
Second, I have a big outstanding change to fact_table.m that would
be affected by a change to the term type, so please warn be before
committing such a change.

A question you did not ask was how the representation of integers
should change in the HLDS, i.e. in the cons_id type. I think I would
prefer adding a size argument to the int_const and uint_const
functors to adding a new int8_const, int16_const etc functors
to the type, because most code would want to treat all integers
the same regardless of size. I would even prefer to erase the
distinction between int_const and uint_const, but realize that
this cannot be done, because in the HLDS, we definitely want
the constant in integer, not string, form, and there is no word
sized type that can hold both all ints and all uints. However,
we could switch to int_const(integer, signedness, maybe(size)).

The checks I mentioned above (does e.g. a i8 fit in -128 to 127,
are the _s in the right place) would naturally fit in the code
(in superhomogeneous.m, I think) that converts from term consts
to cons_ids.

Post by Julien Fischer
4. poly_type and format.
5. Reverse modes of arithmetic operations.

I will comment on these later.

Zoltan.

Sebastian Godelet

2016-10-28 03:25:31 UTC

Hi Zoltan,

Behalf Of Zoltan Somogyi
Sent: Friday, October 28, 2016 11:06
Subject: Re: [m-dev.] for discussion: design issue for new integer types
On Fri, 28 Oct 2016 11:51:51 +1100 (AEDT), Julien Fischer

Just a reminder: please ensure you have updated to rotd-2016-10-26 by
Monday; I will be commiting part 2 of the uint change then.

I have updated.

In order to not overload the current type checker, literals for each
integer type will need to be lexically distinct. My suggestion is
that each integer type have a distinguishing suffix. For the fixed
i8, i16, i32, i64
u8, u16, u32, u64

I think those are fine.

i, u
iw, uw (where w == "word sized")

I prefer i and u.

The suffix would not be required literals of type 'int'.

I don't think anyone would argue against that.

Should the suffixes use only lowercase 'u' and 'i' or would uppercase
also be acceptable? Or uppercase only?

I would strongly argue against allowing I as a suffix. I am probably the only
one on this list who is old enough to have used a typewriter whose numbers
did not include 0 and 1: you were supposed to type 'o' instead of '0', and 'l'
instead of '1'. This works for human readers precisely because an 'l' is easily
visually confusable with '1'. For them, this fact allowed them to save the cost
of a key. For us, the fact that an uppercase i is also visually confusable with 1
would be a fruitful source of confusing bugs, where people *think* they are
looking at 421, but are actually looking at 42 with an I suffix.
On the basis of symmetry, this would argue against U as well.

I know breaking symmetry is not always good but in this case I think that using only uppercase "L" and not allowing "uppercase i" or "lowercase L" would work better than not allowing any uppercase suffixes at all.

Additionally maybe "i" shouldn't be used at all since a) it is the default and b) it could be reserved for eventual inclusion of complex number literals

As an aside: it's long since time we allowed some form of separator
between groups of digits in integer (and float) literals. I propose
that we allow '_' between digits as in Java and C#.

I agree that is a good idea.

Yes +1 on this.
I'm not sure what others views on C++ having user-defined suffixes are. Can be useful for certain type of code

A followup question: should we require that the _s be where western
convention dictates the decimal commas should go, i.e. between every third
digit? I for one would prefer that, but people using the indian number
system, which puts commas around groups of *two* digits above the
thousands, would probably prefer that there not be such a rule (look up
"lakh" or "crore" on wikipedia).

Same is for the Chinese number system, so they might not group integers with 3 digits each, so this should be flexible as well.

We would need to delete the _s at some point anyway. If we do it in the
compiler, we can make the coding doing the deletion generate a warning if
the _s are in the "wrong" place, with the notion of "wrong" being selected by
compiler options such as --warn-misplaced-integer-underscores-
{western,indian}.
When the compiler finds e.g. an i8 suffix on any integer outside the -128 to
127 range, we want to generate an error message anyway.
The _ check could be done at the same time.
I don't think the *scanner* should do such checks, because it would result in
suboptimal error messages. The compiler has access to error_util.m; the
library does not.

2. Automatic coercion and promotion.
There won't be any in Mercury. If you are converting between integer
types then you will be required to say so.

Agreed.
What form would those explicit coercions take? Would we have a specific
function for each pair of integer types? How about e.g. i16 to float: would
you have to convert the i16 to int first?

3. Representation of new integer types in the term type.
How should the new new integer types be represented in the
term.const/0 type?
:- type const
---> atom(string)
; integer(int)
; big_integer(integer_base, integer)
% An integer that is too big for `int'.
; unsigned_integer(uint)
; big_unsigned_integer(integer_base, integer).
% An unsigned integer that is too big for `uint'.
; string(string)
; float(float)
; implementation_defined(string)
; uint8(uint8)
; uint16(uint16)
; uint32(uint32)
; uint64(uint64)
; int8(int8)
; int16(int16)
; int32(int32)
; int64(int64).

I would instead suggest that we keep just the existing integer and
big_integer functors, and add a new argument to both.
This argument would say int vs uint, and 8 vs 16 vs 32 vs 64 vs default size,
*purely on the basis of the suffix, without any check in the scanner*, for
reason given above.
To allow the underscore check mentioned above, the existing argument of
the integer and big_integer functors would need to be a string, with the
conversion done in the compiler. However, doing that would erase the need
for the big_integer functor, since the integer functor would then be able to
represent everything it can.
Two other things. First, some people may be using the library's lexer and
parser modules for their own purposes (e.g. Prolog interpreters), so if we
change their basic representation, we should add their old versions to e.g.
extras under names such as old_{lexer,parser}.m.
Second, I have a big outstanding change to fact_table.m that would be
affected by a change to the term type, so please warn be before committing
such a change.
A question you did not ask was how the representation of integers should
change in the HLDS, i.e. in the cons_id type. I think I would prefer adding a
size argument to the int_const and uint_const functors to adding a new
int8_const, int16_const etc functors to the type, because most code would
want to treat all integers the same regardless of size. I would even prefer to
erase the distinction between int_const and uint_const, but realize that this
cannot be done, because in the HLDS, we definitely want the constant in
integer, not string, form, and there is no word sized type that can hold both
all ints and all uints. However, we could switch to int_const(integer,
signedness, maybe(size)).
The checks I mentioned above (does e.g. a i8 fit in -128 to 127, are the _s in
the right place) would naturally fit in the code (in superhomogeneous.m, I
think) that converts from term consts to cons_ids.

4. poly_type and format.
5. Reverse modes of arithmetic operations.

I will comment on these later.
Zoltan.
_______________________________________________
developers mailing list
https://lists.mercurylang.org/listinfo/developers

Zoltan Somogyi

2016-10-28 03:47:45 UTC

Post by Sebastian Godelet
I know breaking symmetry is not always good but in this case I think that using only uppercase "L" and not allowing "uppercase i" or "lowercase L" would work better than not allowing any uppercase suffixes at all.

What are you proposing that a suffix L should stand for? Surely
not for "signed integer"?

I brought up lowercase L only because that was the old typewriter convention
that I was using to illustrate my point; the letter L is not part of the current
proposal in either lower or upper case.

Post by Sebastian Godelet
Additionally maybe "i" shouldn't be used at all since a) it is the default and b) it could be reserved for eventual inclusion of complex number literals

The proposal is that e.g. 42 as an 8 bit signed integer would be written as
42i8. Without the i, that is 428.

Complex number literals would need a more complicated syntax anyway,
since they include *two* numbers. They are also inherently floating point
numbers (I don't think I have even heard of anyone using the set of complex
*integers* for anything), so the presence of a decimal point in those numbers
would distinguish them anyway. (We currently use the presence/absence
of a decimal point to separate floats from ints.)

Zoltan.

Sebastian Godelet

2016-10-28 05:25:55 UTC

Hi,

What are you proposing that a suffix L should stand for? Surely not for "signed integer"?

Oh I think I misunderstood how the suffixes are going to work,
I thought there would be the usual (as in many other programming languages) 1000L and 1.4f and the like. But now if the bitness is always included in the suffix that is of course pointless.

Sorry,

Sebastian

Sebastian Godelet

2016-10-28 11:49:25 UTC

Hi Zoltan,

I know this might be out of scope but what about the arbitrary precision type "integer", would it be possible to have a literal syntax for that type as well? As a future proposal at least.

cheers,

Sebastian

Julien Fischer

2016-10-30 01:29:12 UTC

Hi Sebastian,

Post by Sebastian Godelet
I know this might be out of scope but what about the arbitrary
precision type "integer", would it be possible to have a literal
syntax for that type as well?

It would be possible, however I'm not convinced that it would be a
terribly useful thing. Once unsigned types are in the language (and
stable) we should revisit the implementation of arbitrary precision
integers since it will be impossible to improve it quite a bit -- that's
a separate discussion though.

Julien.

Peter Wang

2016-10-28 04:29:38 UTC

Post by Julien Fischer
Just a reminder: please ensure you have updated to rotd-2016-10-26
by Monday; I will be commiting part 2 of the uint change then.

I have updated.

I think those are fine.

Post by Julien Fischer
i, u
iw, uw (where w == "word sized")

I prefer i and u.

As do I.

Post by Julien Fischer
The suffix would not be required literals of type 'int'.

I don't think anyone would argue against that.

Post by Julien Fischer
Should the suffixes use only lowercase 'u' and 'i' or would uppercase
also be acceptable? Or uppercase only?

I would strongly argue against allowing I as a suffix. I am probably
the only one on this list who is old enough to have used a typewriter
whose numbers did not include 0 and 1: you were supposed to type
'o' instead of '0', and 'l' instead of '1'. This works for human readers
precisely because an 'l' is easily visually confusable with '1'. For them,
this fact allowed them to save the cost of a key. For us, the fact that
an uppercase i is also visually confusable with 1 would be
a fruitful source of confusing bugs, where people *think* they are
looking at 421, but are actually looking at 42 with an I suffix.
On the basis of symmetry, this would argue against U as well.

I am fine with only lowercase.

I agree that is a good idea.
A followup question: should we require that the _s be where western
convention dictates the decimal commas should go, i.e. between
every third digit? I for one would prefer that, but people using the
indian number system, which puts commas around groups of *two*
digits above the thousands, would probably prefer that there
not be such a rule (look up "lakh" or "crore" on wikipedia).
We would need to delete the _s at some point anyway. If we do it
in the compiler, we can make the coding doing the deletion
generate a warning if the _s are in the "wrong" place, with the
notion of "wrong" being selected by compiler options such as
--warn-misplaced-integer-underscores-{western,indian}.
When the compiler finds e.g. an i8 suffix on any integer outside
the -128 to 127 range, we want to generate an error message anyway.
The _ check could be done at the same time.
I don't think the *scanner* should do such checks, because
it would result in suboptimal error messages. The compiler
has access to error_util.m; the library does not.

Underscores will also help the readability of literals in other bases.
For hexadecimal you'd probably want to separate digits into groups of
4 or 8.

When working with some protocol or file format, it may be useful to
group digits according to how information is packed into certain bits
in that format.

You could have different rules depending on the base, or only check
decimal literals. However, in my experience long integer literals are
rare (say, over five digits long), and most of *those* are hexadecimal.
Therefore, I doubt the value of such a check for catching errors.

Peter

Julien Fischer

2016-10-28 12:59:24 UTC

Post by Peter Wang
Underscores will also help the readability of literals in other bases.
For hexadecimal you'd probably want to separate digits into groups of
4 or 8.
When working with some protocol or file format, it may be useful to
group digits according to how information is packed into certain bits
in that format.

In a word, this.

Post by Peter Wang
You could have different rules depending on the base, or only check
decimal literals.

You could; you won't.

Julien.

Julien Fischer

2016-10-30 01:24:04 UTC

Hi Zoltan,

I think those are fine.

Post by Julien Fischer
i, u
iw, uw (where w == "word sized")

I prefer i and u.

Ok. I've started writing up a change to the reference manual for all
this. I'll add the sized fixed types as well and we can comment them out
until they're actually added.

Post by Julien Fischer
The suffix would not be required literals of type 'int'.

I don't think anyone would argue against that.

I certainly hope not!

...

As Peter has mentioned elsewhere in this thread, there are *good* reasons
why their positioning should be left up to the programmer.

Post by Zoltan Somogyi
We would need to delete the _s at some point anyway. If we do it
in the compiler, we can make the coding doing the deletion
generate a warning if the _s are in the "wrong" place, with the
notion of "wrong" being selected by compiler options such as
--warn-misplaced-integer-underscores-{western,indian}.

I don't think that's something the compiler should be concerned with
(except possibly in the formatting of error messages).

Post by Julien Fischer
2. Automatic coercion and promotion.
There won't be any in Mercury. If you are converting between integer
types then you will be required to say so.

Agreed.
What form would those explicit coercions take? Would we have a specific
function for each pair of integer types?

Yes. The existing numeric types (int, float, rational, integer) already
define these sort of coercions; with the new types there's just going
to be a lot more of them.

Post by Zoltan Somogyi
How about e.g. i16 to float: would you have to convert the i16 to int first?

I think having the function int16.to_float (or float.from_int16) is reasonable
enough, there's no need to go via an int.

I prefer the second scheme.

Post by Zoltan Somogyi
Two other things. First, some people may be using the library's
lexer and parser modules for their own purposes (e.g. Prolog interpreters),
so if we change their basic representation, we should add their old
versions to e.g. extras under names such as old_{lexer,parser}.m.

Ok, I will add a copy of the existing modules to extras.

Post by Zoltan Somogyi
Second, I have a big outstanding change to fact_table.m that would
be affected by a change to the term type, so please warn be before
committing such a change.

Will do. Such a change is some way off yet in any case.

Post by Zoltan Somogyi
A question you did not ask was how the representation of integers
should change in the HLDS, i.e. in the cons_id type. I think I would
prefer adding a size argument to the int_const and uint_const
functors to adding a new int8_const, int16_const etc functors
to the type, because most code would want to treat all integers
the same regardless of size.

Ok.

Post by Zoltan Somogyi
I would even prefer to erase the distinction between int_const and
uint_const, but realize that this cannot be done, because in the HLDS, we
definitely want the constant in integer, not string, form, and there is no
word sized type that can hold both all ints and all uints. However, we could
switch to int_const(integer, signedness, maybe(size)).

Ok.

Post by Zoltan Somogyi
The checks I mentioned above (does e.g. a i8 fit in -128 to 127,
are the _s in the right place) would naturally fit in the code
(in superhomogeneous.m, I think) that converts from term consts
to cons_ids.

Post by Julien Fischer
4. poly_type and format.
5. Reverse modes of arithmetic operations.

I will comment on these later.

Do you have a preference as to the type of the second operand of the
shift operations (point 6 in my original post).

Julien.

Zoltan Somogyi

2016-10-30 05:33:39 UTC

Post by Zoltan Somogyi
A followup question: should we require that the _s be where western
convention dictates the decimal commas should go, i.e. between
every third digit? I for one would prefer that, but people using the
indian number system, which puts commas around groups of *two*
digits above the thousands, would probably prefer that there
not be such a rule (look up "lakh" or "crore" on wikipedia).

As Peter has mentioned elsewhere in this thread, there are *good* reasons
why their positioning should be left up to the programmer.

My use of “require” was the wrong word. As I thought my next paragraph
made clear, I was thinking only about a warning, and only if the
programmer explicitly *asked* for it.

I don't think that's something the compiler should be concerned with
(except possibly in the formatting of error messages).

What is the “that” the compiler shouldn’t be concerned with?
Generating warnings, or deleting the _s?

Deleting the _s has to be done *somewhere*, either the library
or the compiler.

Post by Julien Fischer
Yes. The existing numeric types (int, float, rational, integer) already
define these sort of coercions; with the new types there's just going
to be a lot more of them.

With N integer types, there will need to be N*(N-1) coercion functions.
I guess N is just low enough for that to be manageable.

Do you propose to avoid a further multiplication by two by allowing
just one, not both, of e.g. int8_to_int32 and int32_from_int8?

Post by Zoltan Somogyi
I would instead suggest that we keep just the existing
integer and big_integer functors, and add a new argument to both.
This argument would say int vs uint, and 8 vs 16 vs 32 vs 64 vs
default size, *purely on the basis of the suffix, without any check
in the scanner*, for reason given above.
To allow the underscore check mentioned above, the existing argument
of the integer and big_integer functors would need to be a string,
with the conversion done in the compiler. However, doing that
would erase the need for the big_integer functor, since the integer
functor would then be able to represent everything it can.

I prefer the second scheme.

Do you mean the one in the paragraph that starts with “To allow the
underscore …”?

Ok.

Ok to which part of the paragraph? Switching to
"int_const(integer, signedness, maybe(size))”?

Post by Julien Fischer
Do you have a preference as to the type of the second operand of the
shift operations (point 6 in my original post).

Not yet; I will have to think more about that.

Zoltan.

Julien Fischer

2016-11-01 22:51:27 UTC

Hi Zoltan,

My use of ârequireâ was the wrong word. As I thought my next paragraph
made clear, I was thinking only about a warning, and only if the
programmer explicitly *asked* for it.

I don't think that's something the compiler should be concerned with
(except possibly in the formatting of error messages).

What is the âthatâ the compiler shouldnât be concerned with?
Generating warnings, or deleting the _s?

By "that" I mean generating the warnings.

...

Post by Julien Fischer
Yes. The existing numeric types (int, float, rational, integer) already
define these sort of coercions; with the new types there's just going
to be a lot more of them.

With N integer types, there will need to be N*(N-1) coercion functions.
I guess N is just low enough for that to be manageable.
Do you propose to avoid a further multiplication by two by allowing
just one, not both, of e.g. int8_to_int32 and int32_from_int8?

In short, yes. Looking at the current standard library, the whole issue
of what coercion functions we provide and were they live is a bit more
complicated than I initially thought and probably best discussed in a
separate thread (and in the presence of a concrete proposal, which
I will write up).

I prefer the second scheme.

Do you mean the one in the paragraph that starts with âTo allow the
underscore âŠâ?

Yes, I mean the proposal that stores the number as a string and erases
the need for the big_integer functor.