Discussion:
[m-dev.] I want to move and rename the dependency_graph module
Paul Bone
2017-02-06 06:06:13 UTC
Permalink
compiler/dependency_graph.m has types and routines used to to build and use
a dependency graph of the HLDS. I want to add the same for the MLDS. To
avoid confusion I want to rename this module hlds.dependency_graph.m so I
also have mlds.dependency_graph.m I also plan to extract the common code
and probably put it in either dependency_graph.m or libs.dependency_graph.m

The current dependency_graph module is part of the transform_hlds module. I
would also like to move it to the hlds module. I think it makes more sense
there since it's utility code rather than a transformation.

So. I have:

dependency_graph.m:
Move this from the transform_hlds to the hlds parent module.

Rename it hlds.dependency_graph.m

mlds.dependency_graph.m:
A new module.

dependency_graph.m or
libs.dependency_graph.m:
The code common to the two above modules, plus some code from
hlds_module which also belongs here. Calling it dependency_graph.m is
okay, because it's generic. But it could confuse people looking for
hlds.dependency_graph.m. So I should probably call it
libs.dependency_graph.m

Does anyone see any problems with this? Particularly with moving
dependency_graph from transform_hlds to hlds?

Thanks.
--
Paul Bone
http://paul.bone.id.au
Julien Fischer
2017-02-06 11:53:44 UTC
Permalink
Hi Paul,
Post by Paul Bone
compiler/dependency_graph.m has types and routines used to to build and use
a dependency graph of the HLDS. I want to add the same for the MLDS. To
avoid confusion I want to rename this module hlds.dependency_graph.m so I
also have mlds.dependency_graph.m
In keeping with the names of existing modules that define the HLDS and
MLDS respectively I would name them hlds_dependency_graph.m and
ml_dependency_graph.m.
Post by Paul Bone
I also plan to extract the common code and probably put it in either
dependency_graph.m or libs.dependency_graph.m
What code will end up in common? Would any of the common code be better
placed in library/digraph.m?
Post by Paul Bone
The current dependency_graph module is part of the transform_hlds module. I
would also like to move it to the hlds module. I think it makes more sense
there since it's utility code rather than a transformation.
There is some sense in that as it's imported by other packages in the
compiler, notably top_level and check_hlds.
Post by Paul Bone
Move this from the transform_hlds to the hlds parent module.
Rename it hlds.dependency_graph.m
A new module.
dependency_graph.m or
The code common to the two above modules, plus some code from
hlds_module which also belongs here. Calling it dependency_graph.m is
okay, because it's generic. But it could confuse people looking for
hlds.dependency_graph.m. So I should probably call it
libs.dependency_graph.m
Does anyone see any problems with this?
Other than that I don't like the module / filenames you've picked, no.

Julien.
Paul Bone
2017-02-07 00:59:03 UTC
Permalink
Post by Julien Fischer
Hi Paul,
Post by Paul Bone
compiler/dependency_graph.m has types and routines used to to build and use
a dependency graph of the HLDS. I want to add the same for the MLDS. To
avoid confusion I want to rename this module hlds.dependency_graph.m so I
also have mlds.dependency_graph.m
In keeping with the names of existing modules that define the HLDS and
MLDS respectively I would name them hlds_dependency_graph.m and
ml_dependency_graph.m.
Okay.
Post by Julien Fischer
Post by Paul Bone
I also plan to extract the common code and probably put it in either
dependency_graph.m or libs.dependency_graph.m
What code will end up in common? Would any of the common code be better
placed in library/digraph.m?
Some of the types.

Perhaps some of it can go in the library. I'll check this as I go.
Post by Julien Fischer
Post by Paul Bone
The current dependency_graph module is part of the transform_hlds module. I
would also like to move it to the hlds module. I think it makes more sense
there since it's utility code rather than a transformation.
There is some sense in that as it's imported by other packages in the
compiler, notably top_level and check_hlds.
That's what I thought. There are a few cases were've I've been able to
remove the dependency on the entire transform_hlds module.

Cheers.
--
Paul Bone
http://paul.bone.id.au
Zoltan Somogyi
2017-02-10 02:05:59 UTC
Permalink
Paul, I assume that you want dependency graphs for the MLDS
so that you can implement tail recursion for mutually recursive tail calls.
Is this correct?

When you first proposed that a bit more than a year ago, you didn't give us
much detail about how you planned to do it. Perhaps you could do so now.
It is always better to agree on the design approach before coding.

Are you targeting all MLDS backends, or just C? And what mechanism
do you intend to use for parameter passing, and for telling the trampoline
which member of the clique, if any, to call next? There is more than one
possible choice for both those questions; which combinations have you
tested for performance? You said then you had a script to help you explore
the issue; could you send it to us?

Zoltan.
Paul Bone
2017-02-13 06:10:20 UTC
Permalink
Post by Zoltan Somogyi
Paul, I assume that you want dependency graphs for the MLDS
so that you can implement tail recursion for mutually recursive tail calls.
Is this correct?
Yes. First I'm improving the warnings for non tail-recursive code. Then
I'll work implementing tail recursion for mutually recursive calls.
Post by Zoltan Somogyi
When you first proposed that a bit more than a year ago, you didn't give us
much detail about how you planned to do it. Perhaps you could do so now.
It is always better to agree on the design approach before coding.
I hadn't yet planned how to do it. It's fairer to say that I was planning
how to plan to do it ;-) There are a number of things that we (YesLogic has
had some discussions) think are worth trying.
Post by Zoltan Somogyi
Are you targeting all MLDS backends, or just C? And what mechanism
do you intend to use for parameter passing, and for telling the trampoline
which member of the clique, if any, to call next? There is more than one
possible choice for both those questions; which combinations have you
tested for performance? You said then you had a script to help you explore
the issue; could you send it to us?
I'm aiming at high-level C in particular, but some things will work on most
MLDS backends.

My scripts were nothing fancy, It just ran and timed the programs that I had
prepared. I modified the generated code by hand to make the case I needed.

There are three approaches worth trying and their combinations.

+ Inlining
+ We'll do it
+ Let the C compiler do it.

Inlining
--------

Inlining can be used to remove mutual calls or reduce SCCs.

A -> B <-> C
| |
V V
D E

C can probably be inlined into B to remove the mutual recursion, The call to
B within C becomes a self-recursive call. I haven't yet thought through all
the cases where this will or won't work, or is mediocre. I also want to
read the HLDS inclining code to get a sense for what can be done easily.

We'll do it
-----------

This is the optimisation that we discussed last year. So far I had
imagined creating a struct, probably on the stack of the call that enters
the SCC. The struct is passed by reference between all the members of the
SCC. The struct's fields are the arguments passed between the clique's
members (those in tail position) and their uses may overlap. The trampoline
would be the simple loop as shown in the Dr Dobbs article that cites Fergus
and yourself.

Func* fp = entry;
while (fp != NULL) {
fp = (Func*) (*fp)();
}

However if it is to pass the struct of parameters it would look a little
different.

Func* fp = entry;
ParamStruct s;

s.param1 = param1;
s.param2 = param2;

while (fp != NULL) {
fp = (Func*) (*fp)(&s);
}

Outputs can also be retrieved from this struct.

I imagined using this type of trampoline and struct because it seems to be
the most straight forward, therefore the most likely to succeed. So far my
testing was to establish that handling mutual recursion would provide some
benefit, and not how to handle it best, so I didn't test any other methods.

One thing that occurs to me now is to pass the arguments normally. We would
have to pass _all_ the arguments of the members to _each_ of the members.
This would allow the C compiler to decide where to place them (such as in
registers). Return parameters would also have to be handled. (Hrm, even in
normal code we could optimize in/out pairs by letting them share a single
C parameter.)

Another possibility for the trampoline is to return a token (a member of an
enum) and switch on it to decide which function to execute next. We can
pass a more precise parameter list in this case. This gives the C compiler
more control of parameter passing, at the cost of a little more indirection
in the trampoline.

Finally all the members of the SCC could be placed in the same function
body, and use goto statements to handle tail-calls. This is likely to be
very fast, but it only works for SCCs with a single entry point and no
recursive non-tail calls that arn't also that entry point, at least without
code duplication.

I don't know of any other solutions that are also compatible with whichever
revision of the C standard we use (I forget and can't easily find this
information). Let me know if there's an option that I've missed.

Let the C compiler do it
------------------------

Third, it seems that if things are "just right" compilers like GCC / Clang
can do the tail recursion themselves. In practice "just right" means that
the parameter lists and return values are the same. We could either use the
struct from above, or simply make the parameter lists match by adding unused
arguments as necessary. This may share code with the above idea, simply
disabling the use of a trampoline when we know the C compiler matches one
that we know will do the mutual recursion.

It's very likely that we'll be using a C compiler that supports this, so
it's a good option even if it's not portable.

If we're able to place all the procedures in a single C function with gotos
for tailcalls then that may be better than letting the C compiler implement
the tailcalls, we'd need to test.

Thanks.
--
Paul Bone
http://paul.bone.id.au
Zoltan Somogyi
2017-02-14 05:58:05 UTC
Permalink
Post by Paul Bone
I hadn't yet planned how to do it. It's fairer to say that I was planning
how to plan to do it ;-) There are a number of things that we (YesLogic has
had some discussions) think are worth trying.
My original question was about what designs for ParamStruct you have explored,
but it seems I have gone further on this than you have. A brain dump follows,
interleaved with relevant parts of your message.

All the following assumes model_det code; for model_semi, what I present below
would need tweaks, while for model_non, the whole discussion is moot.
Post by Paul Bone
There are three approaches worth trying and their combinations.
+ Inlining
+ We'll do it
+ Let the C compiler do it.
In terms of implementation, the "inlining" approach is completely
independent of the others.

Lets say an SCC contains n procedures, P1 through Pn.
If the set of tail recursive calls in the SCC is {P1 -> P2, P2 -> P3, ...
Pn-1 -> Pn, Pn -> P1}, i.e. each calls the next one and the last one
calls the first, then inlining is clearly what we want to do. For each
Pi that is called from above the SCC, we would inline the callee at every
tail recursive call site except the one that calls Pi itself. This will give
both the Mercury compiler and later the target language compiler the
best chance to optimize the code. (Any recursive calls that are not TAIL
recursive would be left alone.)

If the number of entry points to the SCC is Ne, this will yield Ne copies
of the code of every procedure in the SCC. However, I think we can handle that.

First, if Ne is 2 or 3 or even 4, the code size increase is probably a price
most people would be willing to pay for reducing stack usage from linear
in the depth of the tail recursion to constant. Second, if Ne is so high
that the user would not want to pay the code size cost, it is trivial to
add a limit: don't do this if Ne exceeds a configurable threshold.
Third, in a nontrivial number cases Ne will in fact be just one. This is
because programmers may split the code executed in each iteration of a loop
into more than one piece when its length, complexity and/or its indentation
level becomes too much.
Post by Paul Bone
I also want to
read the HLDS inclining code to get a sense for what can be done easily.
Inlining.m is missing other good inlining heuristics as well;
it has just the most basic ones. Don't take its existing structure
as sancrosanct.
Post by Paul Bone
We'll do it
-----------
This is the optimisation that we discussed last year.
...
The struct is passed by reference between all the members of the
SCC.
The trampoline handles only TAIL recursive calls, so that if e.g.
Pk is not called by a TAIL call in the SCC, then (a) ParamStruct will never
need to handle the parameters of Pk, and likewise (b) fp will never point
to Pk.

Lets call the set of Pi that have TAIL calls to them in the SCC the TSCC.
The rest of this email will restricts its attention to the TSCC.

By definition, all the Pi in the TSCC have the same vector of output arguments
in terms of type and meaning, although the names of the variables representing
the same argument may differ between procedures. However, this is not true
for their input arguments.

Suppose the input arguments of Pi are PiI1 ... PiImi. The simplest design
for ParamStruct is something that corresponds to this C type,
where e.g. P1I1 stands for the type and the name of the first input arg
of the first proc in the TSCC.

struct {
P1I1,
P1I2,
...
P1Im1,

P2I1,
P2I2,
...
P2Im2,

...

PnI1,
PnI2,
...
PnImn
}

(I am ignoring the outputs, but they would just be additional fields here.)

This is what you proposed, and it should work in all of our current MLDS
target languages.

The next simplest is this type:

union {
struct {
P1I1,
P1I2,
...
P1Im1
},
struct {
P2I1,
P2I2,
...
P2Im2
},
...
struct {
PnI1,
PnI2,
...
PnImn
}
}

This would have smaller stack usage when targeting C, but would not work
for Java; I don't about C#. Also, it would need extension to the MLDS itself;
it already has init_struct, but it doesn't have init_union. Also, I don't
believe the MLDS has have real any support for structs on the stack, though
it definitely does have support for structs on the heap and in read-only
memory. This is because it always treats the variables it stores in stack
frames individually, not as a group.

When I thought about that, I realized that the trampoline loop does NOT
Post by Paul Bone
ParamStruct s;
...
while (fp != NULL) {
fp = (Func*) (*fp)(&s);
}
It can look like this:

// PiIj for all i and j, NOT in a struct
while (p_to_call != 0) {
switch (p_to_call) {
1:
// the code of P1, in which each tail call (to e.g. Pk)
// is replaced with
//
// assignments to PkI1, ... PkImk;
// an assignment of k to p_to_call, and
// a continue
p_to_call = 0;
continue;
2:
// Likewise for the other procedures in the TSCC
...
}
}

To achieve this, the MLDS backend would have to do code generation SCC by SCC.
For each SCC, it would need to know what TSCCs, if any, exist inside it.
It would do this by using the same algorithm to find SCCs as we already use,
but this time, using as edges only the calls in the SCC that are both
RECURSIVE calls and TAIL calls. This will partition the procedures in the SCC
into one or more TSCCs. (While all the procedures in the SCC are by definition
reachable from all other procedures in the SCC, they need not be reachable
via TAIL calls from all other procedures in the SCC.)

For TSCCs that contain only one procedure, we would translate that procedure
the usual way. For TSCCs that contain two or more procedures, we would want
to generate code that follows the scheme above.

For this, we would want a generalized version of the existing ml_gen_proc
in the compiler, which can be told that the procedure is part of a TSCC.
For such procedures, ml_proc_gen would need to put something into the
ml_gen_info that causes ml_call_gen to follow the template above for
tail recursive calls. It would also need to generate target language
variable names that have distinguishing prefix or suffix, so that
they won't clash with the names of target variables generated for
possibly-identically named variables in other procs in the TSCC.

The code generated for a procedure this way would be the body of one of
the switch arms above. We would need to generate the switch itself,
and everything that goes with it, for each entry point in the TSCC.
(Note that a procedure can be an entry point of a TSCC *without* being
an entry point of the SCC that contains the TSCC.) If the TSCC has just
one entry point, that is good. If it has more than one entry point,
this would mean code duplication, and the only difference between the copies
would be the code before the switch that initializes both p_to_call and
the input arguments of the procedure to call. It should be possible
to avoid this by generating a procedure whose argument list is

- all of the PiIj, and
- p_to_call.

Then for each entry point, we could just call this TSCC procedure
specifying the id of the procedure as p_to_call, the actual values
of its input arguments, and dummy values for the input args of all
the other procs in the TSCC.

However, I am not sure about how easy it would be to generate dummy
arguments that would be type correct in each target language.
I also think that duplicating the switch once, and may be twice,
would be preferable performance-wise to having to pass the dummy arguments
in the first place.

The setting of p_to_call to a value, followed by a continue that
goes immediately to a switch on p_to_call, is effectively a goto
to the switch arm selected by the value assigned to p_to_call.
I hope that most target language compilers, including gcc and clang,
would recognize this fact, effectively yielding the code you proposed.
However, letting that target language compiler do this would let us avoid
including the concepts of labels and gotos in the MLDS. (In fact, I think
that the absence of those concepts from the MLDS is one of the important
things that differentiates the MLDS from the LLDS.)
Post by Paul Bone
Outputs can also be retrieved from this struct.
I imagined using this type of trampoline and struct because it seems to be
the most straight forward, therefore the most likely to succeed. So far my
testing was to establish that handling mutual recursion would provide some
benefit, and not how to handle it best, so I didn't test any other methods.
One thing that occurs to me now is to pass the arguments normally. We would
have to pass _all_ the arguments of the members to _each_ of the members.
As I mention above, I am not sure that would be easy.
Post by Paul Bone
This would allow the C compiler to decide where to place them (such as in
registers). Return parameters would also have to be handled. (Hrm, even in
normal code we could optimize in/out pairs by letting them share a single
C parameter.)
See below.
Post by Paul Bone
Another possibility for the trampoline is to return a token (a member of an
enum) and switch on it to decide which function to execute next. We can
pass a more precise parameter list in this case. This gives the C compiler
more control of parameter passing, at the cost of a little more indirection
in the trampoline.
Finally all the members of the SCC could be placed in the same function
body, and use goto statements to handle tail-calls. This is likely to be
very fast, but it only works for SCCs with a single entry point and no
recursive non-tail calls that arn't also that entry point, at least without
code duplication.
See above for how this should be doable at reasonable programming cost.
Post by Paul Bone
Third, it seems that if things are "just right" compilers like GCC / Clang
can do the tail recursion themselves. In practice "just right" means that
the parameter lists and return values are the same.
I think that approach would pretty fragile. We would want to give guarantees
about what tail calls consume stack and what calls don't; I don't think we can
subcontract such guarantees to the target language compiler.
Post by Paul Bone
We could either use the
struct from above, or simply make the parameter lists match by adding unused
arguments as necessary.
The latter is unnecessarily slow.
Post by Paul Bone
This may share code with the above idea, simply
disabling the use of a trampoline when we know the C compiler matches one
that we know will do the mutual recursion.
That test would be a bitch to keep up to date.
Post by Paul Bone
If we're able to place all the procedures in a single C function with gotos
for tailcalls then that may be better than letting the C compiler implement
the tailcalls, we'd need to test.
If the best that you can hope for from having the C compiler optimize tail calls
is that the optimization transforms code that uses separate functions into
code that replaces the tail calls with gotos in a merged function body,
then generating separate functions cannot yield faster code; the only reason
for preferring that approach would be easier coding on our part. I think
the approach I proposed above should be easy enough to implement.

------------------------------

Another issue is that mutually recursive procedures often have one or more
arguments that are passed to them from above the SCC and which are passed down
the chain of recursive calls. In some cases, the argument is always passed
unchanged; in some other cases, it is sometimes passed unchanged, and
sometimes passed updated. In both cases, we should avoid the need for
every tail recursive call to set the value of these parameters, because
that is wasted work in the common case that the value is passed along
unchanged.

With the design above, such parameters could be stored in e.g. P1I1, P2I1, and
P3I2. The target language compiler may find out that it can store all these
variables in the same stack slot, making assignments such as P3I2 = P2I1
just before a tail call from P2 to P3 a no-operation. If some target language
compilers don't, then the Mercury compiler could itself compute which
the sets of input arguments, one argument from each TSCC, form a single
parameters in this sense. We could then get ml_gen_proc to refer to that
input argument by the shared name, and ml_call_gen to optimize its passing.

------------------------------

The above approach places completely different demands on the dependency
graph module than your approach. First, it means that we DON'T need
dependency graphs for the MLDS, so the splitting of the dependency_graph.m
module isn't needed either. (Though moving it from transform_hlds to hlds
is still a good idea.) Second, we would need *different* changes to
the module: the ability to compute TSCCs, and the ability to compute
entry points for both SCCs and TSCCs. Third, we would want to know
the structure of the tail recursive calls inside an SCC: for example,
are they linear?

This last point brings the discussion full circle, back to the start of the thread.

Zoltan.
Paul Bone
2017-02-15 02:08:47 UTC
Permalink
Post by Zoltan Somogyi
Post by Paul Bone
I hadn't yet planned how to do it. It's fairer to say that I was planning
how to plan to do it ;-) There are a number of things that we (YesLogic has
had some discussions) think are worth trying.
My original question was about what designs for ParamStruct you have explored,
but it seems I have gone further on this than you have. A brain dump follows,
interleaved with relevant parts of your message.
All the following assumes model_det code; for model_semi, what I present below
would need tweaks, while for model_non, the whole discussion is moot.
Post by Paul Bone
There are three approaches worth trying and their combinations.
+ Inlining
+ We'll do it
+ Let the C compiler do it.
In terms of implementation, the "inlining" approach is completely
independent of the others.
Lets say an SCC contains n procedures, P1 through Pn.
If the set of tail recursive calls in the SCC is {P1 -> P2, P2 -> P3, ...
Pn-1 -> Pn, Pn -> P1}, i.e. each calls the next one and the last one
calls the first, then inlining is clearly what we want to do. For each
Pi that is called from above the SCC, we would inline the callee at every
tail recursive call site except the one that calls Pi itself. This will give
both the Mercury compiler and later the target language compiler the
best chance to optimize the code. (Any recursive calls that are not TAIL
recursive would be left alone.)
If the number of entry points to the SCC is Ne, this will yield Ne copies
of the code of every procedure in the SCC. However, I think we can handle that.
We have to include any recursive calls that are left alone in the number Ne
(for this purpose). So it could be higher.
Post by Zoltan Somogyi
First, if Ne is 2 or 3 or even 4, the code size increase is probably a price
most people would be willing to pay for reducing stack usage from linear
in the depth of the tail recursion to constant. Second, if Ne is so high
that the user would not want to pay the code size cost, it is trivial to
add a limit: don't do this if Ne exceeds a configurable threshold.
Third, in a nontrivial number cases Ne will in fact be just one. This is
because programmers may split the code executed in each iteration of a loop
into more than one piece when its length, complexity and/or its indentation
level becomes too much.
I wanted to think about in what cases inlining non-tail calls would also
help.

a(...) :-
...,
b(...).

b(...) :-
...,
c(...),
....

c(...) :-
...,
a(...).

Inlining c into b in this situation won't help straight away, the call to a
in c (now in b) is still not in tail position.

Inlining a into c, and b into a, will help, it'll still use linear stack
space but not as much as before. But it requires that b is either the entry
point of the SCC, or that the path(s) through the SCC towards b are
duplicated.

Regarding code duplication. We would also want to consider inlining in
branches:

a(x, ...) :-
...,
b(...).
a(y, ...) :-
...,
b(...).

b(...) :-
...,
a(...).

Inlining b into a will create more code. This is also something we'd need
to factor into inlining decisions.

I would propose a cost model, except that we cannot hope to (analytically)
know the cost of code duplication vs the cost of non-tail calls. However I
think it would be useful to create and agree upon some kind of model. Even
if we can't compare code duplication and non-tail calls, we can calculate
code duplication.
Post by Zoltan Somogyi
Post by Paul Bone
I also want to
read the HLDS inclining code to get a sense for what can be done easily.
Inlining.m is missing other good inlining heuristics as well;
it has just the most basic ones. Don't take its existing structure
as sancrosanct.
Yep, only insofar as what's easy to add without too much effort.i
Post by Zoltan Somogyi
Post by Paul Bone
We'll do it
-----------
This is the optimisation that we discussed last year.
...
The struct is passed by reference between all the members of the
SCC.
The trampoline handles only TAIL recursive calls, so that if e.g.
Pk is not called by a TAIL call in the SCC, then (a) ParamStruct will never
need to handle the parameters of Pk, and likewise (b) fp will never point
to Pk.
True. My bad, i wasn't being precise.
Post by Zoltan Somogyi
Lets call the set of Pi that have TAIL calls to them in the SCC the TSCC.
The rest of this email will restricts its attention to the TSCC.
By definition, all the Pi in the TSCC have the same vector of output arguments
in terms of type and meaning, although the names of the variables representing
the same argument may differ between procedures. However, this is not true
for their input arguments.
Suppose the input arguments of Pi are PiI1 ... PiImi. The simplest design
for ParamStruct is something that corresponds to this C type,
where e.g. P1I1 stands for the type and the name of the first input arg
of the first proc in the TSCC.
struct {
P1I1,
P1I2,
...
P1Im1,
P2I1,
P2I2,
...
P2Im2,
...
PnI1,
PnI2,
...
PnImn
}
(I am ignoring the outputs, but they would just be additional fields here.)
This is what you proposed, and it should work in all of our current MLDS
target languages.
I was thinking of something that is the super-set each PiI1..In set. Using
unions (your suggestion below) would be good when it's supported.
Post by Zoltan Somogyi
union {
struct {
P1I1,
P1I2,
...
P1Im1
},
struct {
P2I1,
P2I2,
...
P2Im2
},
...
struct {
PnI1,
PnI2,
...
PnImn
}
}
This would have smaller stack usage when targeting C, but would not work
for Java; I don't about C#. Also, it would need extension to the MLDS itself;
it already has init_struct, but it doesn't have init_union. Also, I don't
believe the MLDS has have real any support for structs on the stack, though
it definitely does have support for structs on the heap and in read-only
memory. This is because it always treats the variables it stores in stack
frames individually, not as a group.
When I thought about that, I realized that the trampoline loop does NOT
Post by Paul Bone
ParamStruct s;
...
while (fp != NULL) {
fp = (Func*) (*fp)(&s);
}
// PiIj for all i and j, NOT in a struct
while (p_to_call != 0) {
switch (p_to_call) {
// the code of P1, in which each tail call (to e.g. Pk)
// is replaced with
//
// assignments to PkI1, ... PkImk;
// an assignment of k to p_to_call, and
// a continue
p_to_call = 0;
continue;
// Likewise for the other procedures in the TSCC
...
}
}
One thought about inlining the TSCC members, rather than calling them in
each switch arm. Is that if they are outlined we can let the C compiler
decide if inlining is worth while. It may have more information than we do
to make such decisions.

In either case I like this idea, after inlining, the C compiler has a lot of
freedom in how it places and accesses the variables that were the input
arguments between the calls.
Post by Zoltan Somogyi
To achieve this, the MLDS backend would have to do code generation SCC by SCC.
For each SCC, it would need to know what TSCCs, if any, exist inside it.
It would do this by using the same algorithm to find SCCs as we already use,
but this time, using as edges only the calls in the SCC that are both
RECURSIVE calls and TAIL calls. This will partition the procedures in the SCC
into one or more TSCCs. (While all the procedures in the SCC are by definition
reachable from all other procedures in the SCC, they need not be reachable
via TAIL calls from all other procedures in the SCC.)
For TSCCs that contain only one procedure, we would translate that procedure
the usual way. For TSCCs that contain two or more procedures, we would want
to generate code that follows the scheme above.
Would an MLDS->MLDS transformation be easier?
Post by Zoltan Somogyi
For this, we would want a generalized version of the existing ml_gen_proc
in the compiler, which can be told that the procedure is part of a TSCC.
For such procedures, ml_proc_gen would need to put something into the
ml_gen_info that causes ml_call_gen to follow the template above for
tail recursive calls. It would also need to generate target language
variable names that have distinguishing prefix or suffix, so that
they won't clash with the names of target variables generated for
possibly-identically named variables in other procs in the TSCC.
The code generated for a procedure this way would be the body of one of
the switch arms above. We would need to generate the switch itself,
and everything that goes with it, for each entry point in the TSCC.
(Note that a procedure can be an entry point of a TSCC *without* being
an entry point of the SCC that contains the TSCC.) If the TSCC has just
one entry point, that is good. If it has more than one entry point,
this would mean code duplication, and the only difference between the copies
would be the code before the switch that initializes both p_to_call and
the input arguments of the procedure to call. It should be possible
to avoid this by generating a procedure whose argument list is
- all of the PiIj, and
- p_to_call.
Then for each entry point, we could just call this TSCC procedure
specifying the id of the procedure as p_to_call, the actual values
of its input arguments, and dummy values for the input args of all
the other procs in the TSCC.
However, I am not sure about how easy it would be to generate dummy
arguments that would be type correct in each target language.
Java and C# both have null which can be used for objects. I think we
understand the possible primitive types (like int) and can use values such
as 0.

Can a user give a foreign_type pragma with a primitive Java/C# type? I
don't think that would interact with Mercury's polymorphism well so it's
probably not supported.
Post by Zoltan Somogyi
I also think that duplicating the switch once, and may be twice,
would be preferable performance-wise to having to pass the dummy arguments
in the first place.
Yes, it definitely gives the C/Java/C# compiler more freedom.
Post by Zoltan Somogyi
The setting of p_to_call to a value, followed by a continue that
goes immediately to a switch on p_to_call, is effectively a goto
to the switch arm selected by the value assigned to p_to_call.
I hope that most target language compilers, including gcc and clang,
would recognize this fact, effectively yielding the code you proposed.
However, letting that target language compiler do this would let us avoid
including the concepts of labels and gotos in the MLDS. (In fact, I think
that the absence of those concepts from the MLDS is one of the important
things that differentiates the MLDS from the LLDS.)
We already have goto and labels in the MLDS. Using them rather than a
switch is probably okay. I don't know what the backend compilers will
optimize WRT switches. Using a goto to hreak out of the loop would also be
more direct, it allows us to use an infinite loop and avoid the branch
associated with the loop condition.

... Snip ...
Post by Zoltan Somogyi
Post by Paul Bone
Third, it seems that if things are "just right" compilers like GCC / Clang
can do the tail recursion themselves. In practice "just right" means that
the parameter lists and return values are the same.
I think that approach would pretty fragile. We would want to give guarantees
about what tail calls consume stack and what calls don't; I don't think we can
subcontract such guarantees to the target language compiler.
Post by Paul Bone
We could either use the
struct from above, or simply make the parameter lists match by adding unused
arguments as necessary.
The latter is unnecessarily slow.
Post by Paul Bone
This may share code with the above idea, simply
disabling the use of a trampoline when we know the C compiler matches one
that we know will do the mutual recursion.
That test would be a bitch to keep up to date.
I don't think it would be that hard. However I prefer the above solution
anyway, I think it will (usually) lead to better code/performance.

If the above solution would create too much code duplication, particularly
when we can't get constant stack space due to non-tail recursive calls,
then I think we should fall back to either the trampoline, or letting the C
compiler do it.
Post by Zoltan Somogyi
Post by Paul Bone
If we're able to place all the procedures in a single C function with gotos
for tailcalls then that may be better than letting the C compiler implement
the tailcalls, we'd need to test.
If the best that you can hope for from having the C compiler optimize tail calls
is that the optimization transforms code that uses separate functions into
code that replaces the tail calls with gotos in a merged function body,
then generating separate functions cannot yield faster code; the only reason
for preferring that approach would be easier coding on our part. I think
the approach I proposed above should be easy enough to implement.
I agree, you're probably right that that is all the C compiler attempts to
do. Since most C code won't depend on LCO. If (big IF) it implemented
proper LCO then this would be different. But if it implemented proper LCO
then it wouldn't require the argument lists to match anyway.
Post by Zoltan Somogyi
------------------------------
Another issue is that mutually recursive procedures often have one or more
arguments that are passed to them from above the SCC and which are passed down
the chain of recursive calls. In some cases, the argument is always passed
unchanged; in some other cases, it is sometimes passed unchanged, and
sometimes passed updated. In both cases, we should avoid the need for
every tail recursive call to set the value of these parameters, because
that is wasted work in the common case that the value is passed along
unchanged.
With the design above, such parameters could be stored in e.g. P1I1, P2I1, and
P3I2. The target language compiler may find out that it can store all these
variables in the same stack slot, making assignments such as P3I2 = P2I1
just before a tail call from P2 to P3 a no-operation. If some target language
compilers don't, then the Mercury compiler could itself compute which
the sets of input arguments, one argument from each TSCC, form a single
parameters in this sense. We could then get ml_gen_proc to refer to that
input argument by the shared name, and ml_call_gen to optimize its passing.
Implementing this in the MLDS as a seperate excess assignment elimination
wouldn't be too difficult.
Post by Zoltan Somogyi
------------------------------
The above approach places completely different demands on the dependency
graph module than your approach. First, it means that we DON'T need
dependency graphs for the MLDS, so the splitting of the dependency_graph.m
module isn't needed either. (Though moving it from transform_hlds to hlds
is still a good idea.) Second, we would need *different* changes to
the module: the ability to compute TSCCs, and the ability to compute
entry points for both SCCs and TSCCs. Third, we would want to know
the structure of the tail recursive calls inside an SCC: for example,
are they linear?
I have a version of ml_dependency_graph.m in my workspace. It can be
modified to add support for TSCCs. My big question at this point is whether
we'd prefer to do this as an MLDS->MLDS transformation.

They have to be linear, there can only be one tail call on any branch.
Other SCC calls make it non-linear, but they're not TSCC calls. Maybe I
misunderstood what you mean by linear.
Post by Zoltan Somogyi
This last point brings the discussion full circle, back to the start of the thread.
:-)
--
Paul Bone
http://paul.bone.id.au
Julien Fischer
2017-02-15 02:18:11 UTC
Permalink
Hi Paul,
Post by Paul Bone
Post by Zoltan Somogyi
Then for each entry point, we could just call this TSCC procedure
specifying the id of the procedure as p_to_call, the actual values
of its input arguments, and dummy values for the input args of all
the other procs in the TSCC.
However, I am not sure about how easy it would be to generate dummy
arguments that would be type correct in each target language.
Java and C# both have null which can be used for objects. I think we
understand the possible primitive types (like int) and can use values such
as 0.
Can a user give a foreign_type pragma with a primitive Java/C# type?
Yes, both backends allow the foreign type to be any accessible Java / C#
type.
Post by Paul Bone
I don't think that would interact with Mercury's polymorphism well so
it's probably not supported.
Both backends already support generating an appropriate initializer for
any mlds_type (and hence Mercury type), since we need to assign values
to local variables in the generated code to "get around" definite
assignment analysis.
(See mlds_to_java.get_java_type_initializer/1 and the similar
predicate in mlds_to_cs.m.)

Julien.
Zoltan Somogyi
2017-02-15 03:15:31 UTC
Permalink
Post by Paul Bone
Post by Zoltan Somogyi
Lets say an SCC contains n procedures, P1 through Pn.
If the set of tail recursive calls in the SCC is {P1 -> P2, P2 -> P3, ...
Pn-1 -> Pn, Pn -> P1}, i.e. each calls the next one and the last one
calls the first, then inlining is clearly what we want to do. For each
Pi that is called from above the SCC, we would inline the callee at every
tail recursive call site except the one that calls Pi itself. This will give
both the Mercury compiler and later the target language compiler the
best chance to optimize the code. (Any recursive calls that are not TAIL
recursive would be left alone.)
If the number of entry points to the SCC is Ne, this will yield Ne copies
of the code of every procedure in the SCC. However, I think we can handle that.
We have to include any recursive calls that are left alone in the number Ne
(for this purpose). So it could be higher.
Yes, you are right. If any member of the SCC is called from the SCC itself
in a non-tail position, that member would need an entry point, even if it is
not called from anywhere in any higher SCC.
Post by Paul Bone
I wanted to think about in what cases inlining non-tail calls would also
help.
When the callee is a recursive predicate, inlining it is rarely a good idea,
outside of special circumstances such as the discussed above.
Post by Paul Bone
Inlining b into a will create more code. This is also something we'd need
to factor into inlining decisions.
The decisions you want inlining to make is not "do I inline B into A?", but
rather "do I inline B into *this call site* in A?". It is perfectly possible
to inline B into some call site(s) in A but not into others.
Post by Paul Bone
I would propose a cost model, except that we cannot hope to (analytically)
know the cost of code duplication vs the cost of non-tail calls. However I
think it would be useful to create and agree upon some kind of model. Even
if we can't compare code duplication and non-tail calls, we can calculate
code duplication.
Actually, that is effectively impossible to do accurately in inlining.m by itself.
The reason is what you *want* to measure is NOT the immediate increase
in code size generated by inlining, but the *final*, overall increase in code size,
and *that* is affected by every optimization that runs after inlining.
That measure is affected by how the particular details of a given call site
affect the optimizations that can be applied to inlined copy of the callee.
For example, if the callee has a switch on X, and a call site sets the variable
that becomes the input arg X to a given function symbol, later optimizations
can replace the whole switch with just the selected arm. This means that
in some cases, even inlining a procedure with a huge body can lead to an
overall *reduction* in code size, if later optimizations can delete everything
in the callee's body except e.g. a switch arm containing a single unification,
whose code is smaller than the code required to make the call.

Therefore *accurate* computation of the code duplication you end up with
(as opposed to what you *start* with) is possible only if inlining runs
all the later passes of the compiler on the output of each inlining trial.
Post by Paul Bone
I was thinking of something that is the super-set each PiI1..In set.
The names we generate for MLDS variables include the variable number
of the HLDS variable they represent. Since these will coincide between
procedures only by accident, this is essentially equivalent. However,
just because there is an MLDS variable named ABC_42 in both p and q
does NOT mean that they are bound to even have the same type,
much less any kind of data flow relationship.
Post by Paul Bone
One thought about inlining the TSCC members, rather than calling them in
each switch arm. Is that if they are outlined we can let the C compiler
decide if inlining is worth while. It may have more information than we do
to make such decisions.
It *may* have more information than we do about whether inlining is
worthwhile, though I doubt it. I am certain, though, that *we* have
more information about whether inlining is *safe*. Since the target language
compilers will always need to be conservative about safety, I don't think
leaving the inlining up to them is a good idea.
Post by Paul Bone
In either case I like this idea, after inlining, the C compiler has a lot of
freedom in how it places and accesses the variables that were the input
arguments between the calls.
Yes.
Post by Paul Bone
Post by Zoltan Somogyi
To achieve this, the MLDS backend would have to do code generation SCC by SCC.
For each SCC, it would need to know what TSCCs, if any, exist inside it.
It would do this by using the same algorithm to find SCCs as we already use,
but this time, using as edges only the calls in the SCC that are both
RECURSIVE calls and TAIL calls. This will partition the procedures in the SCC
into one or more TSCCs. (While all the procedures in the SCC are by definition
reachable from all other procedures in the SCC, they need not be reachable
via TAIL calls from all other procedures in the SCC.)
For TSCCs that contain only one procedure, we would translate that procedure
the usual way. For TSCCs that contain two or more procedures, we would want
to generate code that follows the scheme above.
Would an MLDS->MLDS transformation be easier?
No.

If you have a choice between generating the wrong thing and then fixing it up
or generating the right thing directly, always choose the latter. The code that
does the fixups in the first approach would always be vulnerable to changes
in the code generation itself screwing up the patterns that it looks for;
this could happen even if those changes were optimizations for unrelated
purposes.

I learned this lesson the hard way.
Post by Paul Bone
Post by Zoltan Somogyi
Post by Paul Bone
This may share code with the above idea, simply
disabling the use of a trampoline when we know the C compiler matches one
that we know will do the mutual recursion.
That test would be a bitch to keep up to date.
I don't think it would be that hard. However I prefer the above solution
anyway, I think it will (usually) lead to better code/performance.
I wasn't thinking about it being hard; I was thinking that it would be *annoying*,
given that it would have to be repeated after every new version of gcc/clang/etc.
Post by Paul Bone
Post by Zoltan Somogyi
Another issue is that mutually recursive procedures often have one or more
arguments that are passed to them from above the SCC and which are passed down
the chain of recursive calls. In some cases, the argument is always passed
unchanged; in some other cases, it is sometimes passed unchanged, and
sometimes passed updated. In both cases, we should avoid the need for
every tail recursive call to set the value of these parameters, because
that is wasted work in the common case that the value is passed along
unchanged.
With the design above, such parameters could be stored in e.g. P1I1, P2I1, and
P3I2. The target language compiler may find out that it can store all these
variables in the same stack slot, making assignments such as P3I2 = P2I1
just before a tail call from P2 to P3 a no-operation. If some target language
compilers don't, then the Mercury compiler could itself compute which
the sets of input arguments, one argument from each TSCC, form a single
parameters in this sense. We could then get ml_gen_proc to refer to that
input argument by the shared name, and ml_call_gen to optimize its passing.
Implementing this in the MLDS as a seperate excess assignment elimination
wouldn't be too difficult.
The existing excess assignment elimination pass works only on conjunctions;
if it sees that one conjunct is X1 := X0, then it can replace X0 with X1,
or vice versa, in all the later conjuncts. This is relatively simple to do.
In the case we are talking about, the assignment Xq := Xp will appear just before
the tailcall from p to q, implemented as a branch back to the top of the driver
loop, so this won't work. You will have to extend the analysis to q's branch in
the switch, at which point the analysis needed is so different that you may as well
call it by some other name.

When a procedure body updates STATE_VARIABLE_X_0 to become
STATE_VARIABLE_X_1 and then never refers to STATE_VARIABLE_X_0 again,
I trust the target language compilers to reuse the same stack slot for both,
because keeping track of such things is easy for them in code without backward
branches. Tail calls introduce backward branches, making the job much harder,
not just for target language compilers, but for any MLDS analyses as well.

By contrast, things are much simpler at the HLDS level: you just need to match
the input args in the procedure head with those in the input arg list of the
recursive call. If e.g. input arg 2 in the head and input arg 3 in the tail call
have the same type and the same name modulo a numeric suffix, then we
can use the same MLDS var for input arg 2 of this proc and input arg 3 of the
tail call's callee proc.

Obviously, this test can be made more sophisticated, and independent of
variable names, but this simple test will get up a large fraction, probably
well over half, of the improvement that is potentially available from
this kind of optimization.
Post by Paul Bone
I have a version of ml_dependency_graph.m in my workspace. It can be
modified to add support for TSCCs. My big question at this point is whether
we'd prefer to do this as an MLDS->MLDS transformation.
I don't think so. As far as I can see, implementing tail calls in the MLDS
is best done at the HLDS to MLDS stage, not as an MLDS to MLDS transformation.
Post by Paul Bone
They have to be linear, there can only be one tail call on any branch.
Other SCC calls make it non-linear, but they're not TSCC calls. Maybe I
misunderstood what you mean by linear.
I meant that each procedure in the SCC contains just one tail call,
to some member of the SCC other than itself. For example,
if an SCC contains p, q, r and s, then the only tail calls are p->q,
q->r, r->s and s->p. If you add a tail call p->r, then it is not linear.
This kind of linearity test is for the inlining we discussed at the top.

Zoltan.
Paul Bone
2017-02-15 03:58:01 UTC
Permalink
Post by Zoltan Somogyi
Post by Paul Bone
I would propose a cost model, except that we cannot hope to (analytically)
know the cost of code duplication vs the cost of non-tail calls. However I
think it would be useful to create and agree upon some kind of model. Even
if we can't compare code duplication and non-tail calls, we can calculate
code duplication.
Actually, that is effectively impossible to do accurately in inlining.m by itself.
The reason is what you *want* to measure is NOT the immediate increase
in code size generated by inlining, but the *final*, overall increase in code size,
and *that* is affected by every optimization that runs after inlining.
That measure is affected by how the particular details of a given call site
affect the optimizations that can be applied to inlined copy of the callee.
For example, if the callee has a switch on X, and a call site sets the variable
that becomes the input arg X to a given function symbol, later optimizations
can replace the whole switch with just the selected arm. This means that
in some cases, even inlining a procedure with a huge body can lead to an
overall *reduction* in code size, if later optimizations can delete everything
in the callee's body except e.g. a switch arm containing a single unification,
whose code is smaller than the code required to make the call.
Therefore *accurate* computation of the code duplication you end up with
(as opposed to what you *start* with) is possible only if inlining runs
all the later passes of the compiler on the output of each inlining trial.
Right, that sounds like an interesting problem. But one I'm not going to
solve any time soon.
Post by Zoltan Somogyi
Post by Paul Bone
I was thinking of something that is the super-set each PiI1..In set.
The names we generate for MLDS variables include the variable number
of the HLDS variable they represent. Since these will coincide between
procedures only by accident, this is essentially equivalent. However,
just because there is an MLDS variable named ABC_42 in both p and q
does NOT mean that they are bound to even have the same type,
much less any kind of data flow relationship.
Two variables should decide if they can share a field by their type, not by
their name.

So if I have:

void a(MR_Word a, MR_Word b, MyType *c);

void b(YourType *d, MR_Word e);

Then I can have:

struct ParamStruct1 (
MR_Word a_and_e;
MR_Word b;
MyType *c;
YourType *d;
};

It's only useful if the foreign language doesn't have unions.
Post by Zoltan Somogyi
Post by Paul Bone
One thought about inlining the TSCC members, rather than calling them in
each switch arm. Is that if they are outlined we can let the C compiler
decide if inlining is worth while. It may have more information than we do
to make such decisions.
It *may* have more information than we do about whether inlining is
worthwhile, though I doubt it. I am certain, though, that *we* have
more information about whether inlining is *safe*. Since the target language
compilers will always need to be conservative about safety, I don't think
leaving the inlining up to them is a good idea.
I was thinking that it would have information about the relative cost of a
C call versus placing code inline.

Except for static variables in C, when is inlining unsafe? I'm now asking
for my curiosity, I think both approaches would work almost as well as
each-other.
Post by Zoltan Somogyi
Post by Paul Bone
Post by Zoltan Somogyi
To achieve this, the MLDS backend would have to do code generation SCC by SCC.
For each SCC, it would need to know what TSCCs, if any, exist inside it.
It would do this by using the same algorithm to find SCCs as we already use,
but this time, using as edges only the calls in the SCC that are both
RECURSIVE calls and TAIL calls. This will partition the procedures in the SCC
into one or more TSCCs. (While all the procedures in the SCC are by definition
reachable from all other procedures in the SCC, they need not be reachable
via TAIL calls from all other procedures in the SCC.)
For TSCCs that contain only one procedure, we would translate that procedure
the usual way. For TSCCs that contain two or more procedures, we would want
to generate code that follows the scheme above.
Would an MLDS->MLDS transformation be easier?
No.
If you have a choice between generating the wrong thing and then fixing it up
or generating the right thing directly, always choose the latter. The code that
does the fixups in the first approach would always be vulnerable to changes
in the code generation itself screwing up the patterns that it looks for;
this could happen even if those changes were optimizations for unrelated
purposes.
I learned this lesson the hard way.
The case of detecting mutual recursion seems simple enough. However I don't
have a good reason to choose one over the other, so I'm happy to agree with
you. It's also a good principal.
Post by Zoltan Somogyi
Post by Paul Bone
Post by Zoltan Somogyi
Post by Paul Bone
This may share code with the above idea, simply
disabling the use of a trampoline when we know the C compiler matches one
that we know will do the mutual recursion.
That test would be a bitch to keep up to date.
I don't think it would be that hard. However I prefer the above solution
anyway, I think it will (usually) lead to better code/performance.
I wasn't thinking about it being hard; I was thinking that it would be *annoying*,
given that it would have to be repeated after every new version of gcc/clang/etc.
Yep. I'd like to keep it in mind anyway, it may still be useful as a
fallback option.
Post by Zoltan Somogyi
Post by Paul Bone
Post by Zoltan Somogyi
Another issue is that mutually recursive procedures often have one or more
arguments that are passed to them from above the SCC and which are passed down
the chain of recursive calls. In some cases, the argument is always passed
unchanged; in some other cases, it is sometimes passed unchanged, and
sometimes passed updated. In both cases, we should avoid the need for
every tail recursive call to set the value of these parameters, because
that is wasted work in the common case that the value is passed along
unchanged.
With the design above, such parameters could be stored in e.g. P1I1, P2I1, and
P3I2. The target language compiler may find out that it can store all these
variables in the same stack slot, making assignments such as P3I2 = P2I1
just before a tail call from P2 to P3 a no-operation. If some target language
compilers don't, then the Mercury compiler could itself compute which
the sets of input arguments, one argument from each TSCC, form a single
parameters in this sense. We could then get ml_gen_proc to refer to that
input argument by the shared name, and ml_call_gen to optimize its passing.
Implementing this in the MLDS as a seperate excess assignment elimination
wouldn't be too difficult.
The existing excess assignment elimination pass works only on conjunctions;
if it sees that one conjunct is X1 := X0, then it can replace X0 with X1,
or vice versa, in all the later conjuncts. This is relatively simple to do.
In the case we are talking about, the assignment Xq := Xp will appear just before
the tailcall from p to q, implemented as a branch back to the top of the driver
loop, so this won't work. You will have to extend the analysis to q's branch in
the switch, at which point the analysis needed is so different that you may as well
call it by some other name.
When a procedure body updates STATE_VARIABLE_X_0 to become
STATE_VARIABLE_X_1 and then never refers to STATE_VARIABLE_X_0 again,
I trust the target language compilers to reuse the same stack slot for both,
because keeping track of such things is easy for them in code without backward
branches. Tail calls introduce backward branches, making the job much harder,
not just for target language compilers, but for any MLDS analyses as well.
By contrast, things are much simpler at the HLDS level: you just need to match
the input args in the procedure head with those in the input arg list of the
recursive call. If e.g. input arg 2 in the head and input arg 3 in the tail call
have the same type and the same name modulo a numeric suffix, then we
can use the same MLDS var for input arg 2 of this proc and input arg 3 of the
tail call's callee proc.
Obviously, this test can be made more sophisticated, and independent of
variable names, but this simple test will get up a large fraction, probably
well over half, of the improvement that is potentially available from
this kind of optimization.
Okay.

I'd like to handle loop invariants explicitly (re the loop that drives the
recursive calls) but starting here makes sense.
Post by Zoltan Somogyi
Post by Paul Bone
I have a version of ml_dependency_graph.m in my workspace. It can be
modified to add support for TSCCs. My big question at this point is whether
we'd prefer to do this as an MLDS->MLDS transformation.
I don't think so. As far as I can see, implementing tail calls in the MLDS
is best done at the HLDS to MLDS stage, not as an MLDS to MLDS transformation.
I agree, in the long term this and most of the other work I've done in my
branch is not useful :-(. However it is useful in the short term. I have
warnings for mutual recursion now 95% working (the UI needs fixing) and I
can use this to survey the Prince and other codebases to find out where the
mutual recursions are and what type they are. We can use this to determine
what part of this to start working on first, eg HLDS inlining or HLDS->MLDS
inlining into a loop.
Post by Zoltan Somogyi
Post by Paul Bone
They have to be linear, there can only be one tail call on any branch.
Other SCC calls make it non-linear, but they're not TSCC calls. Maybe I
misunderstood what you mean by linear.
I meant that each procedure in the SCC contains just one tail call,
to some member of the SCC other than itself. For example,
if an SCC contains p, q, r and s, then the only tail calls are p->q,
q->r, r->s and s->p. If you add a tail call p->r, then it is not linear.
This kind of linearity test is for the inlining we discussed at the top.
We're still talking about the proposal to use a loop and switch or gotos?

p_label:
// p contains two tail calls, the only way this can happen si if
// they're on different branches. I suppose it would matter if we
// had to place to goto after the body of the inlined procedure.
if (...) {
...
goto q_label;
} else {
...
goto r_label;
}
q_label:
...
goto r_label;
r_label:
...
goto s_label;
s_label:
...
goto p_label;

end_label:
...

Thanks.
--
Paul Bone
http://paul.bone.id.au
Zoltan Somogyi
2017-02-15 05:46:35 UTC
Permalink
Post by Paul Bone
Post by Zoltan Somogyi
Therefore *accurate* computation of the code duplication you end up with
(as opposed to what you *start* with) is possible only if inlining runs
all the later passes of the compiler on the output of each inlining trial.
Right, that sounds like an interesting problem. But one I'm not going to
solve any time soon.
I don't think anyone will solve it anytime soon. The space of possible algorithms
is very large, and the possible payoff is quite small relative to the amount
of work needed to explore that space.
Post by Paul Bone
Post by Zoltan Somogyi
Post by Paul Bone
I was thinking of something that is the super-set each PiI1..In set.
The names we generate for MLDS variables include the variable number
of the HLDS variable they represent. Since these will coincide between
procedures only by accident, this is essentially equivalent. However,
just because there is an MLDS variable named ABC_42 in both p and q
does NOT mean that they are bound to even have the same type,
much less any kind of data flow relationship.
Two variables should decide if they can share a field by their type, not by
their name.
Sorry, I misunderstood you. I thought by "superset" you meant
using the same MLDS variable for two or more input args in different
procedures if their names and types matched.
Post by Paul Bone
void a(MR_Word a, MR_Word b, MyType *c);
void b(YourType *d, MR_Word e);
struct ParamStruct1 (
MR_Word a_and_e;
MR_Word b;
MyType *c;
YourType *d;
};
It's only useful if the foreign language doesn't have unions.
I will pretend that your function "a" is named p, and your function "b" is named q,
to make the following understandable. (Using overlapping name spaces for different
kinds of things in the same example is not a good idea.)

Using the same field for arg a in proc p and arg e in proc q may save one word
in ParamStruct, but what if the tail call(s) from p to q pass the value of not a, but b
as the second arg of q?

You see, what I am *most* trying to minimize is not the size of ParamStruct,
but the number of assignments to its fields that a tail recursive call needs to make.
In the example above, if p passes b as the second arg of q, you want to store *b*
and e in the same slot, not *a* and e, since this makes the assignment a noop.

Now, if none of p's input args, or their updated versions, are passed as the second
arg of q, then making e share a slot with any input arg of p that is of the correct type
is good idea. But if there are several input args of p with the correct type, you
do *not* want to choose among them randomly.
Post by Paul Bone
Post by Zoltan Somogyi
Post by Paul Bone
One thought about inlining the TSCC members, rather than calling them in
each switch arm. Is that if they are outlined we can let the C compiler
decide if inlining is worth while. It may have more information than we do
to make such decisions.
It *may* have more information than we do about whether inlining is
worthwhile, though I doubt it. I am certain, though, that *we* have
more information about whether inlining is *safe*. Since the target language
compilers will always need to be conservative about safety, I don't think
leaving the inlining up to them is a good idea.
I was thinking that it would have information about the relative cost of a
C call versus placing code inline.
Except for static variables in C, when is inlining unsafe? I'm now asking
for my curiosity, I think both approaches would work almost as well as
each-other.
I was thinking about the safety of deleting the code of a C function
if all calls to that functions have been inlined. To do this, the C compiler
must be sure that noone has taken the address of the function anywhere,
which would require an analysis of the whole namespace in which the
function name is visible.
Post by Paul Bone
Post by Zoltan Somogyi
Obviously, this test can be made more sophisticated, and independent of
variable names, but this simple test will get up a large fraction, probably
well over half, of the improvement that is potentially available from
this kind of optimization.
Okay.
I'd like to handle loop invariants explicitly (re the loop that drives the
recursive calls) but starting here makes sense.
Then we agree; looking for other loop invariants was what I meant
by "more sophisticated tests".
Post by Paul Bone
I agree, in the long term this and most of the other work I've done in my
branch is not useful :-(.
That is why I brought it up in this thread :-(
Post by Paul Bone
However it is useful in the short term. I have
warnings for mutual recursion now 95% working (the UI needs fixing) and I
can use this to survey the Prince and other codebases to find out where the
mutual recursions are and what type they are. We can use this to determine
what part of this to start working on first, eg HLDS inlining or HLDS->MLDS
inlining into a loop.
Good.
Post by Paul Bone
Post by Zoltan Somogyi
Post by Paul Bone
They have to be linear, there can only be one tail call on any branch.
Other SCC calls make it non-linear, but they're not TSCC calls. Maybe I
misunderstood what you mean by linear.
I meant that each procedure in the SCC contains just one tail call,
to some member of the SCC other than itself. For example,
if an SCC contains p, q, r and s, then the only tail calls are p->q,
q->r, r->s and s->p. If you add a tail call p->r, then it is not linear.
This kind of linearity test is for the inlining we discussed at the top.
We're still talking about the proposal to use a loop and switch or gotos?
Neither; as I mention, it is for inlining. If e.g. q is an entry point,
either from higher SCCs or from non-tail recursive calls in this SCC,
then you would want to inline r at the single q->r tail call site,
inline s in the single inlined_r->s tail call site, and inline p at the
single inlined_s_in_inlined_r->p tail call site. This gets all the speed
benefit of the driver loop approach without needing ANY changes
to the code generator, at the cost of some code duplication.
This is therefore the lowest hanging fruit for making mutually recursive
SCCs that fit this pattern use constant stack space. What fraction
of mutually recursive SCCs fit this pattern is of course as yet unknown,
but maybe you can extend the tests you mentioned above to tell us.

Zoltan.
Paul Bone
2017-02-15 06:13:24 UTC
Permalink
Post by Zoltan Somogyi
Post by Paul Bone
Post by Zoltan Somogyi
Therefore *accurate* computation of the code duplication you end up with
(as opposed to what you *start* with) is possible only if inlining runs
all the later passes of the compiler on the output of each inlining trial.
Right, that sounds like an interesting problem. But one I'm not going to
solve any time soon.
I don't think anyone will solve it anytime soon. The space of possible algorithms
is very large, and the possible payoff is quite small relative to the amount
of work needed to explore that space.
Yeah, I thought so too. Didn't we talk about this at lunch with Andy King
once? He mentioned supercompilation in passing and I hadn't heard of it.
Basically you both said that it's something that searches the space of
possible optimisations (or is it something else) and then picks the best?
You both said that it only works for very small programs. That made sense
to me.
Post by Zoltan Somogyi
Post by Paul Bone
Post by Zoltan Somogyi
Post by Paul Bone
I was thinking of something that is the super-set each PiI1..In set.
The names we generate for MLDS variables include the variable number
of the HLDS variable they represent. Since these will coincide between
procedures only by accident, this is essentially equivalent. However,
just because there is an MLDS variable named ABC_42 in both p and q
does NOT mean that they are bound to even have the same type,
much less any kind of data flow relationship.
Two variables should decide if they can share a field by their type, not by
their name.
Sorry, I misunderstood you. I thought by "superset" you meant
using the same MLDS variable for two or more input args in different
procedures if their names and types matched.
Post by Paul Bone
void a(MR_Word a, MR_Word b, MyType *c);
void b(YourType *d, MR_Word e);
struct ParamStruct1 (
MR_Word a_and_e;
MR_Word b;
MyType *c;
YourType *d;
};
It's only useful if the foreign language doesn't have unions.
I will pretend that your function "a" is named p, and your function "b" is named q,
to make the following understandable. (Using overlapping name spaces for different
kinds of things in the same example is not a good idea.)
Using the same field for arg a in proc p and arg e in proc q may save one word
in ParamStruct, but what if the tail call(s) from p to q pass the value of not a, but b
as the second arg of q?
You see, what I am *most* trying to minimize is not the size of ParamStruct,
but the number of assignments to its fields that a tail recursive call needs to make.
In the example above, if p passes b as the second arg of q, you want to store *b*
and e in the same slot, not *a* and e, since this makes the assignment a noop.
Now, if none of p's input args, or their updated versions, are passed as the second
arg of q, then making e share a slot with any input arg of p that is of the correct type
is good idea. But if there are several input args of p with the correct type, you
do *not* want to choose among them randomly.
I didn't consider that, I agree.
Post by Zoltan Somogyi
Post by Paul Bone
Post by Zoltan Somogyi
Post by Paul Bone
One thought about inlining the TSCC members, rather than calling them in
each switch arm. Is that if they are outlined we can let the C compiler
decide if inlining is worth while. It may have more information than we do
to make such decisions.
It *may* have more information than we do about whether inlining is
worthwhile, though I doubt it. I am certain, though, that *we* have
more information about whether inlining is *safe*. Since the target language
compilers will always need to be conservative about safety, I don't think
leaving the inlining up to them is a good idea.
I was thinking that it would have information about the relative cost of a
C call versus placing code inline.
Except for static variables in C, when is inlining unsafe? I'm now asking
for my curiosity, I think both approaches would work almost as well as
each-other.
I was thinking about the safety of deleting the code of a C function
if all calls to that functions have been inlined. To do this, the C compiler
must be sure that noone has taken the address of the function anywhere,
which would require an analysis of the whole namespace in which the
function name is visible.
It should know this if the function is static.
Post by Zoltan Somogyi
Post by Paul Bone
Post by Zoltan Somogyi
Obviously, this test can be made more sophisticated, and independent of
variable names, but this simple test will get up a large fraction, probably
well over half, of the improvement that is potentially available from
this kind of optimization.
Okay.
I'd like to handle loop invariants explicitly (re the loop that drives the
recursive calls) but starting here makes sense.
Then we agree; looking for other loop invariants was what I meant
by "more sophisticated tests".
I thought so.
Post by Zoltan Somogyi
Post by Paul Bone
I agree, in the long term this and most of the other work I've done in my
branch is not useful :-(.
That is why I brought it up in this thread :-(
Thank you.
Post by Zoltan Somogyi
Post by Paul Bone
Post by Zoltan Somogyi
Post by Paul Bone
They have to be linear, there can only be one tail call on any branch.
Other SCC calls make it non-linear, but they're not TSCC calls. Maybe I
misunderstood what you mean by linear.
I meant that each procedure in the SCC contains just one tail call,
to some member of the SCC other than itself. For example,
if an SCC contains p, q, r and s, then the only tail calls are p->q,
q->r, r->s and s->p. If you add a tail call p->r, then it is not linear.
This kind of linearity test is for the inlining we discussed at the top.
We're still talking about the proposal to use a loop and switch or gotos?
Neither; as I mention, it is for inlining. If e.g. q is an entry point,
either from higher SCCs or from non-tail recursive calls in this SCC,
then you would want to inline r at the single q->r tail call site,
inline s in the single inlined_r->s tail call site, and inline p at the
single inlined_s_in_inlined_r->p tail call site. This gets all the speed
benefit of the driver loop approach without needing ANY changes
to the code generator, at the cost of some code duplication.
This is therefore the lowest hanging fruit for making mutually recursive
SCCs that fit this pattern use constant stack space. What fraction
of mutually recursive SCCs fit this pattern is of course as yet unknown,
but maybe you can extend the tests you mentioned above to tell us.
I might be able to automate it, but it's probably equally easy to check each
SCC by hand and make a tally.

I have some similar code in the deep profiler that might be able to help.
The benifit of that is that it can weight each by the number of calls made
into and within the SCC.
--
Paul Bone
http://paul.bone.id.au
Zoltan Somogyi
2017-02-15 06:36:09 UTC
Permalink
Post by Paul Bone
Yeah, I thought so too. Didn't we talk about this at lunch with Andy King
once?
I don't remember.
Post by Paul Bone
He mentioned supercompilation in passing and I hadn't heard of it.
Basically you both said that it's something that searches the space of
possible optimisations (or is it something else) and then picks the best?
You both said that it only works for very small programs. That made sense
to me.
The basic idea is: for each N, there are only a finite number of machine language
programs consisting of N instructions. To supercompile a function in e.g. C, do this:

for N = 0 up to some limit:
for every possible machine language program with exactly N instructions,
with some exceptions:
try to prove that its semantics is exactly the same as the C function
you are trying to compile
if the proof attempt is successful
exit

The "some exceptions" part meant that for instructions that contain immediate values,
the supercompiler will typically explore only a fixed small set of immediates, such as
0, 1, -1, 2, -2, up to (say) 16, instead of all 2^16 possible values in a 16 bit field.
Even then, this is feasible only for VERY small values of N.
Post by Paul Bone
Post by Zoltan Somogyi
I was thinking about the safety of deleting the code of a C function
if all calls to that functions have been inlined. To do this, the C compiler
must be sure that noone has taken the address of the function anywhere,
which would require an analysis of the whole namespace in which the
function name is visible.
It should know this if the function is static.
Yes.
Post by Paul Bone
I might be able to automate it, but it's probably equally easy to check each
SCC by hand and make a tally.
Yes, if the number of SCCs is small enough.
Post by Paul Bone
I have some similar code in the deep profiler that might be able to help.
The benifit of that is that it can weight each by the number of calls made
into and within the SCC.
I don't think we are looking for weighted information.

Zoltan.
Paul Bone
2017-02-15 11:42:59 UTC
Permalink
Post by Zoltan Somogyi
Post by Paul Bone
Yeah, I thought so too. Didn't we talk about this at lunch with Andy King
once?
I don't remember.
Post by Paul Bone
He mentioned supercompilation in passing and I hadn't heard of it.
Basically you both said that it's something that searches the space of
possible optimisations (or is it something else) and then picks the best?
You both said that it only works for very small programs. That made sense
to me.
The basic idea is: for each N, there are only a finite number of machine language
for every possible machine language program with exactly N instructions,
try to prove that its semantics is exactly the same as the C function
you are trying to compile
if the proof attempt is successful
exit
The "some exceptions" part meant that for instructions that contain immediate values,
the supercompiler will typically explore only a fixed small set of immediates, such as
0, 1, -1, 2, -2, up to (say) 16, instead of all 2^16 possible values in a 16 bit field.
Even then, this is feasible only for VERY small values of N.
Yes, that space would be huge. The space of regular compilation decisions
for a program of the same size would be _much_ smaller. I can see that the
benefit of supercompilation is that the compiler implementers don't need to
write any optimisations, the compiler gropes in the darkness for the most
optimal program.
Post by Zoltan Somogyi
Post by Paul Bone
I have some similar code in the deep profiler that might be able to help.
The benifit of that is that it can weight each by the number of calls made
into and within the SCC.
I don't think we are looking for weighted information.
I mostly agree. It depends on what question we're asking though.

At YesLogic we're concerned with the risk that Prince may crash if given
unusual input. Decreasing the probability of any crash is good, even if
we can't guarantee that we'll never run out of stack space. So knowing how
frequently an SCC is entered, and it's "average depth" gives as an idea of
the typical risk of that SCC. But that doesn't help if a user gives a
strange input - probably a lower risk but it does happen.

Average depth = (Calls Within SCC + Calls Into SCC) / Calls Into SCC

We also like improving performance. Prince already performs very well, but
there are some users with rather high demands. Weighted information can
definitely be used to guide performance improvements. But that's a secondary
priority.
--
Paul Bone
http://paul.bone.id.au
Loading...