Add scoped event-time director models#249
Conversation
Add scoped event-time model storage and inference so the dragonfly-dally event-time surrogate can use separate ML models for switch LPs while retaining the existing director request flow. Fix ZeroMQ director request argument handling so command handlers parse the argument-count prefix exactly once. This prevents client IDs such as 1 from being mistaken for a second argument count and dropped from training/inference requests. Restore the ZeroMQ director build path by compiling director-client.C when USE_ZMQML is enabled and propagating the USE_ZMQML compile definition to downstream targets. Clean up the director-client merge conflict around global ZMQ latency statistics and keep the cumulative MPI-reduced DIR_STATS output format. Expose a latency-recording hook so event-time inference requests from dragonfly-dally are included in the shared ZMQ request statistics. Update the event-time workflow to use START_ITER and END_ITER template variables and save/load the scoped event-time model directory rather than a single model file.
This commit formats files with clang-format-20 ,which is used by the CI, instead of just clang-format.
73d52b7 to
e937968
Compare
|
@sanjaychari let me know when you've fixed conflicts and I'll review this branch. I made some improvements to how zeromq related stuff is built, so I think that's where the conflicts are. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| extern "C" void director_record_external_zmq_latency(double processing_sec, double total_sec) | ||
| __attribute__((weak)); |
There was a problem hiding this comment.
the __attribute__((weak)) is unnecessary at this point because this should only be built when #if CODES_HAVE_ZEROMQ.
actually these 2 lines should just be completely removed because this is already declared in director-client.h. It should not be declared again in a cxx file.
| #include "codes/model-net-lp.h" | ||
| #include "codes/surrogate/init.h" | ||
| #if CODES_HAVE_TORCH | ||
| #include "codes/surrogate/director-client.h" |
There was a problem hiding this comment.
I'm guessing this is probably why you added the weak attribute below. This should either be guarded by #if CODES_HAVE_ZEROMQ or I think it could technically be unguarded.
| bash -euxo pipefail -c ' | ||
| apt-get update | ||
| apt-get install -y python3-zmq python3-numpy python3-sklearn python3-pandas python3-pip gettext-base |
There was a problem hiding this comment.
with #255 updating the ci image, this is no longer needed., same for the other places this is done. in the future we should update the docker image with dependencies like this instead of grabbing them in every job. just note that it has to be done in a separate PR because the job that creates the docker image does it on push to master only when the dockerfile changes.
| if(NOT TARGET ROSS::ROSS) | ||
| message(WARNING "ROSS package did not define ROSS::ROSS; creating compatibility imported target.") |
There was a problem hiding this comment.
Add a ci job (can be build only, I don't think it's necessary to run the test suite) that builds CODES with an old ROSS. That way we can be sure this continues to work.
Also change the warning message to a deprecation message (see elsewhere in this file where deprecation is used), because I don't think we should keep this around forever.
| } | ||
| } | ||
|
|
||
| if (director_record_external_zmq_latency) { |
There was a problem hiding this comment.
this shouldn't need to be checked. It's guarded in #if CODES_HAVE_ZEROMQ
| director_record_zmq_latency_values(zmq_processing_time, local_latency_sec); | ||
| } | ||
|
|
||
| extern "C" void director_record_external_zmq_latency(double processing_sec, double total_sec) { |
There was a problem hiding this comment.
the extern "C" is not needed here, only in the header, where you already have it.
This PR makes the following changes.
Add scoped event-time model storage and inference so the
dragonfly-dally event-time surrogate can use separate ML models for
switch LPs while retaining the existing director request flow.
Fix ZeroMQ director request argument handling so command handlers parse
the argument-count prefix exactly once. This prevents client IDs such as
1 from being mistaken for a second argument count and dropped from
training/inference requests.
Restore the ZeroMQ director build path by compiling director-client.C
when USE_ZMQML is enabled and propagating the USE_ZMQML compile
definition to downstream targets.
Clean up the director-client merge conflict around global ZMQ latency
statistics and keep the cumulative MPI-reduced DIR_STATS output format.
Expose a latency-recording hook so event-time inference requests from
dragonfly-dally are included in the shared ZMQ request statistics.
Update the event-time workflow to use START_ITER and END_ITER template
variables and save/load the scoped event-time model directory rather
than a single model file.