[m-dev.] I want to move and rename the dependency

Discussion:

[m-dev.] I want to move and rename the dependency_graph module

Paul Bone

2017-02-06 06:06:13 UTC

compiler/dependency_graph.m has types and routines used to to build and use
a dependency graph of the HLDS. I want to add the same for the MLDS. To
avoid confusion I want to rename this module hlds.dependency_graph.m so I
also have mlds.dependency_graph.m I also plan to extract the common code
and probably put it in either dependency_graph.m or libs.dependency_graph.m

The current dependency_graph module is part of the transform_hlds module. I
would also like to move it to the hlds module. I think it makes more sense
there since it's utility code rather than a transformation.

So. I have:

dependency_graph.m:
Move this from the transform_hlds to the hlds parent module.

Rename it hlds.dependency_graph.m

mlds.dependency_graph.m:
A new module.

dependency_graph.m or
libs.dependency_graph.m:
The code common to the two above modules, plus some code from
hlds_module which also belongs here. Calling it dependency_graph.m is
okay, because it's generic. But it could confuse people looking for
hlds.dependency_graph.m. So I should probably call it
libs.dependency_graph.m

Does anyone see any problems with this? Particularly with moving
dependency_graph from transform_hlds to hlds?

Thanks.

--
Paul Bone
http://paul.bone.id.au

Julien Fischer

2017-02-06 11:53:44 UTC

Permalink

Hi Paul,

Post by Paul Bone
compiler/dependency_graph.m has types and routines used to to build and use
a dependency graph of the HLDS. I want to add the same for the MLDS. To
avoid confusion I want to rename this module hlds.dependency_graph.m so I
also have mlds.dependency_graph.m

In keeping with the names of existing modules that define the HLDS and
MLDS respectively I would name them hlds_dependency_graph.m and
ml_dependency_graph.m.

Post by Paul Bone
I also plan to extract the common code and probably put it in either
dependency_graph.m or libs.dependency_graph.m

What code will end up in common? Would any of the common code be better
placed in library/digraph.m?

Post by Paul Bone
The current dependency_graph module is part of the transform_hlds module. I
would also like to move it to the hlds module. I think it makes more sense
there since it's utility code rather than a transformation.

There is some sense in that as it's imported by other packages in the
compiler, notably top_level and check_hlds.

Post by Paul Bone
Move this from the transform_hlds to the hlds parent module.
Rename it hlds.dependency_graph.m
A new module.
dependency_graph.m or
The code common to the two above modules, plus some code from
hlds_module which also belongs here. Calling it dependency_graph.m is
okay, because it's generic. But it could confuse people looking for
hlds.dependency_graph.m. So I should probably call it
libs.dependency_graph.m
Does anyone see any problems with this?

Other than that I don't like the module / filenames you've picked, no.

Julien.

Paul Bone

2017-02-07 00:59:03 UTC

Permalink

Post by Julien Fischer
Hi Paul,

In keeping with the names of existing modules that define the HLDS and
MLDS respectively I would name them hlds_dependency_graph.m and
ml_dependency_graph.m.

Okay.

Post by Julien Fischer

Post by Paul Bone
I also plan to extract the common code and probably put it in either
dependency_graph.m or libs.dependency_graph.m

What code will end up in common? Would any of the common code be better
placed in library/digraph.m?

Some of the types.

Perhaps some of it can go in the library. I'll check this as I go.

Post by Julien Fischer

There is some sense in that as it's imported by other packages in the
compiler, notably top_level and check_hlds.

That's what I thought. There are a few cases were've I've been able to
remove the dependency on the entire transform_hlds module.

Cheers.

--
Paul Bone
http://paul.bone.id.au

Zoltan Somogyi

2017-02-10 02:05:59 UTC

Permalink

Paul, I assume that you want dependency graphs for the MLDS
so that you can implement tail recursion for mutually recursive tail calls.
Is this correct?

When you first proposed that a bit more than a year ago, you didn't give us
much detail about how you planned to do it. Perhaps you could do so now.
It is always better to agree on the design approach before coding.

Are you targeting all MLDS backends, or just C? And what mechanism
do you intend to use for parameter passing, and for telling the trampoline
which member of the clique, if any, to call next? There is more than one
possible choice for both those questions; which combinations have you
tested for performance? You said then you had a script to help you explore
the issue; could you send it to us?

Zoltan.

Paul Bone

2017-02-13 06:10:20 UTC

Permalink

Post by Zoltan Somogyi
Paul, I assume that you want dependency graphs for the MLDS
so that you can implement tail recursion for mutually recursive tail calls.
Is this correct?

Yes. First I'm improving the warnings for non tail-recursive code. Then
I'll work implementing tail recursion for mutually recursive calls.

Post by Zoltan Somogyi
When you first proposed that a bit more than a year ago, you didn't give us
much detail about how you planned to do it. Perhaps you could do so now.
It is always better to agree on the design approach before coding.

I hadn't yet planned how to do it. It's fairer to say that I was planning
how to plan to do it ;-) There are a number of things that we (YesLogic has
had some discussions) think are worth trying.

Post by Zoltan Somogyi
Are you targeting all MLDS backends, or just C? And what mechanism
do you intend to use for parameter passing, and for telling the trampoline
which member of the clique, if any, to call next? There is more than one
possible choice for both those questions; which combinations have you
tested for performance? You said then you had a script to help you explore
the issue; could you send it to us?

I'm aiming at high-level C in particular, but some things will work on most
MLDS backends.

My scripts were nothing fancy, It just ran and timed the programs that I had
prepared. I modified the generated code by hand to make the case I needed.

There are three approaches worth trying and their combinations.

+ Inlining
+ We'll do it
+ Let the C compiler do it.

Inlining
--------

Inlining can be used to remove mutual calls or reduce SCCs.

A -> B <-> C
| |
V V
D E

C can probably be inlined into B to remove the mutual recursion, The call to
B within C becomes a self-recursive call. I haven't yet thought through all
the cases where this will or won't work, or is mediocre. I also want to
read the HLDS inclining code to get a sense for what can be done easily.

We'll do it
-----------

This is the optimisation that we discussed last year. So far I had
imagined creating a struct, probably on the stack of the call that enters
the SCC. The struct is passed by reference between all the members of the
SCC. The struct's fields are the arguments passed between the clique's
members (those in tail position) and their uses may overlap. The trampoline
would be the simple loop as shown in the Dr Dobbs article that cites Fergus
and yourself.

Func* fp = entry;
while (fp != NULL) {
fp = (Func*) (*fp)();
}

However if it is to pass the struct of parameters it would look a little
different.

Func* fp = entry;
ParamStruct s;

s.param1 = param1;
s.param2 = param2;

while (fp != NULL) {
fp = (Func*) (*fp)(&s);
}

Outputs can also be retrieved from this struct.

I imagined using this type of trampoline and struct because it seems to be
the most straight forward, therefore the most likely to succeed. So far my
testing was to establish that handling mutual recursion would provide some
benefit, and not how to handle it best, so I didn't test any other methods.

One thing that occurs to me now is to pass the arguments normally. We would
have to pass _all_ the arguments of the members to _each_ of the members.
This would allow the C compiler to decide where to place them (such as in
registers). Return parameters would also have to be handled. (Hrm, even in
normal code we could optimize in/out pairs by letting them share a single
C parameter.)

Another possibility for the trampoline is to return a token (a member of an
enum) and switch on it to decide which function to execute next. We can
pass a more precise parameter list in this case. This gives the C compiler
more control of parameter passing, at the cost of a little more indirection
in the trampoline.

Finally all the members of the SCC could be placed in the same function
body, and use goto statements to handle tail-calls. This is likely to be
very fast, but it only works for SCCs with a single entry point and no
recursive non-tail calls that arn't also that entry point, at least without
code duplication.

I don't know of any other solutions that are also compatible with whichever
revision of the C standard we use (I forget and can't easily find this
information). Let me know if there's an option that I've missed.

Let the C compiler do it
------------------------

Third, it seems that if things are "just right" compilers like GCC / Clang
can do the tail recursion themselves. In practice "just right" means that
the parameter lists and return values are the same. We could either use the
struct from above, or simply make the parameter lists match by adding unused
arguments as necessary. This may share code with the above idea, simply
disabling the use of a trampoline when we know the C compiler matches one
that we know will do the mutual recursion.

It's very likely that we'll be using a C compiler that supports this, so
it's a good option even if it's not portable.

If we're able to place all the procedures in a single C function with gotos
for tailcalls then that may be better than letting the C compiler implement
the tailcalls, we'd need to test.

Thanks.

--
Paul Bone
http://paul.bone.id.au

Zoltan Somogyi

2017-02-14 05:58:05 UTC

Permalink

Post by Paul Bone
I hadn't yet planned how to do it. It's fairer to say that I was planning
how to plan to do it ;-) There are a number of things that we (YesLogic has
had some discussions) think are worth trying.

My original question was about what designs for ParamStruct you have explored,
but it seems I have gone further on this than you have. A brain dump follows,
interleaved with relevant parts of your message.

All the following assumes model_det code; for model_semi, what I present below
would need tweaks, while for model_non, the whole discussion is moot.

Post by Paul Bone
There are three approaches worth trying and their combinations.
+ Inlining
+ We'll do it
+ Let the C compiler do it.

In terms of implementation, the "inlining" approach is completely
independent of the others.

Lets say an SCC contains n procedures, P1 through Pn.
If the set of tail recursive calls in the SCC is {P1 -> P2, P2 -> P3, ...
Pn-1 -> Pn, Pn -> P1}, i.e. each calls the next one and the last one
calls the first, then inlining is clearly what we want to do. For each
Pi that is called from above the SCC, we would inline the callee at every
tail recursive call site except the one that calls Pi itself. This will give
both the Mercury compiler and later the target language compiler the
best chance to optimize the code. (Any recursive calls that are not TAIL
recursive would be left alone.)

If the number of entry points to the SCC is Ne, this will yield Ne copies
of the code of every procedure in the SCC. However, I think we can handle that.

First, if Ne is 2 or 3 or even 4, the code size increase is probably a price
most people would be willing to pay for reducing stack usage from linear
in the depth of the tail recursion to constant. Second, if Ne is so high
that the user would not want to pay the code size cost, it is trivial to
add a limit: don't do this if Ne exceeds a configurable threshold.
Third, in a nontrivial number cases Ne will in fact be just one. This is
because programmers may split the code executed in each iteration of a loop
into more than one piece when its length, complexity and/or its indentation
level becomes too much.

Post by Paul Bone
I also want to
read the HLDS inclining code to get a sense for what can be done easily.

Inlining.m is missing other good inlining heuristics as well;
it has just the most basic ones. Don't take its existing structure
as sancrosanct.

Post by Paul Bone
We'll do it
-----------
This is the optimisation that we discussed last year.
...
The struct is passed by reference between all the members of the
SCC.

The trampoline handles only TAIL recursive calls, so that if e.g.
Pk is not called by a TAIL call in the SCC, then (a) ParamStruct will never
need to handle the parameters of Pk, and likewise (b) fp will never point
to Pk.

Lets call the set of Pi that have TAIL calls to them in the SCC the TSCC.
The rest of this email will restricts its attention to the TSCC.

By definition, all the Pi in the TSCC have the same vector of output arguments
in terms of type and meaning, although the names of the variables representing
the same argument may differ between procedures. However, this is not true
for their input arguments.

Suppose the input arguments of Pi are PiI1 ... PiImi. The simplest design
for ParamStruct is something that corresponds to this C type,
where e.g. P1I1 stands for the type and the name of the first input arg
of the first proc in the TSCC.

struct {
P1I1,
P1I2,
...
P1Im1,

P2I1,
P2I2,
...
P2Im2,

...

PnI1,
PnI2,
...
PnImn
}

(I am ignoring the outputs, but they would just be additional fields here.)

This is what you proposed, and it should work in all of our current MLDS
target languages.

The next simplest is this type:

union {
struct {
P1I1,
P1I2,
...
P1Im1
},
struct {
P2I1,
P2I2,
...
P2Im2
},
...
struct {
PnI1,
PnI2,
...
PnImn
}
}

This would have smaller stack usage when targeting C, but would not work
for Java; I don't about C#. Also, it would need extension to the MLDS itself;
it already has init_struct, but it doesn't have init_union. Also, I don't
believe the MLDS has have real any support for structs on the stack, though
it definitely does have support for structs on the heap and in read-only
memory. This is because it always treats the variables it stores in stack
frames individually, not as a group.

When I thought about that, I realized that the trampoline loop does NOT

Post by Paul Bone
ParamStruct s;
...
while (fp != NULL) {
fp = (Func*) (*fp)(&s);
}

It can look like this:

// PiIj for all i and j, NOT in a struct
while (p_to_call != 0) {
switch (p_to_call) {
1:
// the code of P1, in which each tail call (to e.g. Pk)
// is replaced with
//
// assignments to PkI1, ... PkImk;
// an assignment of k to p_to_call, and
// a continue
p_to_call = 0;
continue;
2:
// Likewise for the other procedures in the TSCC
...
}
}

To achieve this, the MLDS backend would have to do code generation SCC by SCC.
For each SCC, it would need to know what TSCCs, if any, exist inside it.
It would do this by using the same algorithm to find SCCs as we already use,
but this time, using as edges only the calls in the SCC that are both
RECURSIVE calls and TAIL calls. This will partition the procedures in the SCC
into one or more TSCCs. (While all the procedures in the SCC are by definition
reachable from all other procedures in the SCC, they need not be reachable
via TAIL calls from all other procedures in the SCC.)

For TSCCs that contain only one procedure, we would translate that procedure
the usual way. For TSCCs that contain two or more procedures, we would want
to generate code that follows the scheme above.

For this, we would want a generalized version of the existing ml_gen_proc
in the compiler, which can be told that the procedure is part of a TSCC.
For such procedures, ml_proc_gen would need to put something into the
ml_gen_info that causes ml_call_gen to follow the template above for
tail recursive calls. It would also need to generate target language
variable names that have distinguishing prefix or suffix, so that
they won't clash with the names of target variables generated for
possibly-identically named variables in other procs in the TSCC.

The code generated for a procedure this way would be the body of one of
the switch arms above. We would need to generate the switch itself,
and everything that goes with it, for each entry point in the TSCC.
(Note that a procedure can be an entry point of a TSCC *without* being
an entry point of the SCC that contains the TSCC.) If the TSCC has just
one entry point, that is good. If it has more than one entry point,
this would mean code duplication, and the only difference between the copies
would be the code before the switch that initializes both p_to_call and
the input arguments of the procedure to call. It should be possible
to avoid this by generating a procedure whose argument list is

- all of the PiIj, and
- p_to_call.

Then for each entry point, we could just call this TSCC procedure
specifying the id of the procedure as p_to_call, the actual values
of its input arguments, and dummy values for the input args of all
the other procs in the TSCC.

However, I am not sure about how easy it would be to generate dummy
arguments that would be type correct in each target language.
I also think that duplicating the switch once, and may be twice,
would be preferable performance-wise to having to pass the dummy arguments
in the first place.

The setting of p_to_call to a value, followed by a continue that
goes immediately to a switch on p_to_call, is effectively a goto
to the switch arm selected by the value assigned to p_to_call.
I hope that most target language compilers, including gcc and clang,
would recognize this fact, effectively yielding the code you proposed.
However, letting that target language compiler do this would let us avoid
including the concepts of labels and gotos in the MLDS. (In fact, I think
that the absence of those concepts from the MLDS is one of the important
things that differentiates the MLDS from the LLDS.)

Post by Paul Bone
Outputs can also be retrieved from this struct.
I imagined using this type of trampoline and struct because it seems to be
the most straight forward, therefore the most likely to succeed. So far my
testing was to establish that handling mutual recursion would provide some
benefit, and not how to handle it best, so I didn't test any other methods.
One thing that occurs to me now is to pass the arguments normally. We would
have to pass _all_ the arguments of the members to _each_ of the members.

As I mention above, I am not sure that would be easy.

Post by Paul Bone
This would allow the C compiler to decide where to place them (such as in
registers). Return parameters would also have to be handled. (Hrm, even in
normal code we could optimize in/out pairs by letting them share a single
C parameter.)

See below.

Post by Paul Bone
Another possibility for the trampoline is to return a token (a member of an
enum) and switch on it to decide which function to execute next. We can
pass a more precise parameter list in this case. This gives the C compiler
more control of parameter passing, at the cost of a little more indirection
in the trampoline.
Finally all the members of the SCC could be placed in the same function
body, and use goto statements to handle tail-calls. This is likely to be
very fast, but it only works for SCCs with a single entry point and no
recursive non-tail calls that arn't also that entry point, at least without
code duplication.

See above for how this should be doable at reasonable programming cost.

Post by Paul Bone
Third, it seems that if things are "just right" compilers like GCC / Clang
can do the tail recursion themselves. In practice "just right" means that
the parameter lists and return values are the same.

I think that approach would pretty fragile. We would want to give guarantees
about what tail calls consume stack and what calls don't; I don't think we can
subcontract such guarantees to the target language compiler.

Post by Paul Bone
We could either use the
struct from above, or simply make the parameter lists match by adding unused
arguments as necessary.

The latter is unnecessarily slow.

Post by Paul Bone
This may share code with the above idea, simply
disabling the use of a trampoline when we know the C compiler matches one
that we know will do the mutual recursion.

That test would be a bitch to keep up to date.

Post by Paul Bone
If we're able to place all the procedures in a single C function with gotos
for tailcalls then that may be better than letting the C compiler implement
the tailcalls, we'd need to test.

If the best that you can hope for from having the C compiler optimize tail calls
is that the optimization transforms code that uses separate functions into
code that replaces the tail calls with gotos in a merged function body,
then generating separate functions cannot yield faster code; the only reason
for preferring that approach would be easier coding on our part. I think
the approach I proposed above should be easy enough to implement.

------------------------------

Another issue is that mutually recursive procedures often have one or more
arguments that are passed to them from above the SCC and which are passed down
the chain of recursive calls. In some cases, the argument is always passed
unchanged; in some other cases, it is sometimes passed unchanged, and
sometimes passed updated. In both cases, we should avoid the need for
every tail recursive call to set the value of these parameters, because
that is wasted work in the common case that the value is passed along
unchanged.

With the design above, such parameters could be stored in e.g. P1I1, P2I1, and
P3I2. The target language compiler may find out that it can store all these
variables in the same stack slot, making assignments such as P3I2 = P2I1
just before a tail call from P2 to P3 a no-operation. If some target language
compilers don't, then the Mercury compiler could itself compute which
the sets of input arguments, one argument from each TSCC, form a single
parameters in this sense. We could then get ml_gen_proc to refer to that
input argument by the shared name, and ml_call_gen to optimize its passing.

------------------------------

The above approach places completely different demands on the dependency
graph module than your approach. First, it means that we DON'T need
dependency graphs for the MLDS, so the splitting of the dependency_graph.m
module isn't needed either. (Though moving it from transform_hlds to hlds
is still a good idea.) Second, we would need *different* changes to
the module: the ability to compute TSCCs, and the ability to compute
entry points for both SCCs and TSCCs. Third, we would want to know
the structure of the tail recursive calls inside an SCC: for example,
are they linear?

This last point brings the discussion full circle, back to the start of the thread.

Zoltan.

Paul Bone

2017-02-15 02:08:47 UTC