[numba.md] Update np.random → Generator API#550
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Many thanks @Chihiro2000GitHub! It looks good to me. Just one small comment!
| n = 1_000_000 | ||
| rng = np.random.default_rng() | ||
| u_draws = rng.uniform(size=n) | ||
| v_draws = rng.uniform(size=n) | ||
|
|
||
| @jit(parallel=True) | ||
| def calculate_pi(n=1_000_000): | ||
| def calculate_pi(u_draws, v_draws): | ||
| n = len(u_draws) | ||
| count = 0 | ||
| for i in prange(n): | ||
| u, v = np.random.uniform(0, 1), np.random.uniform(0, 1) | ||
| u, v = u_draws[i], v_draws[i] | ||
| d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2) | ||
| if d < 0.5: | ||
| count += 1 |
There was a problem hiding this comment.
I think this avoids the thread-safety issue, but the timing comparison is no longer quite apples-to-apples because the random draws have been moved outside the timed function.
Personally, I think we can generate the draws in the first solution by moving
n = 1_000_000
rng = np.random.default_rng()
u_draws = rng.uniform(size=n)
v_draws = rng.uniform(size=n)to line 457, and modifying the function in the code block starting at line 457 (solution to exercise 1) to take u_draws and v_draws, like the function here.
In this case, we would have a fair timing comparison.
Please let me know what you think!
There was a problem hiding this comment.
Hi @mmcky, I think this is an example where we need to make a tough choice between keeping the legacy API for simplicity and using the new rng API with some trade-offs.
There was a problem hiding this comment.
Thanks @HumphreyYang I agree with you. I have added some additional notes below regarding double checking nopython mode as well.
|
Thanks @Chihiro2000GitHub for the careful classification and write-up. As @HumphreyYang says most of the PR looks good. A few things to resolve before merging: 1. Generator passed into
|
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the detailed review, @mmcky and @HumphreyYang! Here is a summary of the changes made and a timing verification. Changes to
|
| Speedup | |
|---|---|
Old API (np.random.uniform inside loop), @jit(nopython=True) vs pure |
~150x |
New API (pre-drawn arrays), @jit(nopython=True) vs pure |
~400–700x (variable across runs) |
Both Cases B and C compiled successfully under @jit(nopython=True) — no object-mode fallback.
Question about "~150 times"
The old API gives a speedup consistent with the existing "around 150 times" claim. However, after the refactor (pre-drawn arrays), the speedup is considerably larger, so the current text no longer holds. Would you like me to update that line? Options I see:
- Replace with "significantly faster"
- Replace with "hundreds of times faster"
- Something else you prefer
You can also reproduce these results by running the code above.
One more question: numba_ex3 timing claim
The solution to numba_ex3 states: "On our workstation, we find that parallelization increases execution speed by a factor of 2 or 3." Since this PR also modified the numba_ex3 solution to use pre-drawn arrays, should we re-verify this figure under the new code as well?
HumphreyYang
left a comment
There was a problem hiding this comment.
Many thanks @Chihiro2000GitHub,
Please find my minor comments attached!
|
Thanks so much for taking another look, @HumphreyYang! This is very helpful. Since Matt has also been involved in this discussion, I'll wait to see if he has any further thoughts before making the next update, so that I can address both comments consistently, @mmcky, I'd be grateful for your thoughts when you have time — no rush at all. |
|
Thanks @Chihiro2000GitHub and @HumphreyYang — and agreed this is the tricky one. Here's where I land so we can converge: Non-parallel cases (B
Timing claims: yes, please update the "around 150 times" / "2 orders of magnitude" line under Everything else looks good. Thanks for the careful iteration on this one! Since there are a couple of judgment calls here where @HumphreyYang and I lean slightly differently, it might be quickest to sort these out on a short Zoom — @HumphreyYang, could we grab 15 minutes to align, then summarise the outcome back here for @Chihiro2000GitHub? |
|
Hi @mmcky, I am free to meet anytime now. Please feel free to share a link to me! |
|
Many thanks @mmcky for the meeting! It was really fun. The code fails very silently and subtly. Please take a look at the build and let me know what you think! |
Thanks @HumphreyYang -- this is really interesting! My main comment is I think we should pull n = 1_000_000
rng = np.random.default_rng()
u_draws = rng.uniform(size=n)
v_draws = rng.uniform(size=n)to be in a code block before exercise 1 starts -- to make it clear that it is re-used across the exercises. I'll have a think about the best way to simplify the 4th exercise. I think this is a REALLY valuable example. |
|
@Chihiro2000GitHub when you have time please review @HumphreyYang suggestion and let us know if this makes sense to you on a fresh read. |
|
Hi @mmcky and @HumphreyYang Thank you so much for your careful and valuable reviews. I really appreciate them. I'll go through your suggestions after I finish checking the chapter I'm currently working on in the new book. Thanks again😊 |
|
@HumphreyYang -- great discussion and nice updates re: spread to demonstrate this problem. It's a really nice example of where you can get a result that looks right but isn't when working on programs. I think a valuable lesson. I think the final question comes down to where we should have this -- here or in a @jstac — looping you in on PR #550 while it's still in review; would value your comments on the scope before we converge. It began as a mechanical np.random → Generator migration in numba.md (QuantEcon/meta#299) but has grown into a substantial pedagogical review of our current style guide. What drove the expansion. Migrating to a Generator exposed a subtle bug: sharing one Generator across a Numba prange parallel loop is a silent data race. The code runs and returns something close to π — it looks correct — but it's (1) not reproducible even with a fixed seed, and (2) far noisier, because threads clobber the generator's shared state, giving correlated/duplicated draws and a much smaller effective sample size. Humphrey confirmed this in a build; it fails quietly enough that a reader would never notice. Relationship to the style guide: This is essentially our existing "don't call rng.* inside a prange loop" warning turned into a full, worked lesson — the guide asserts the corruption; the new exercises show it. The migrated patterns (speed_ex1, speed_ex2, numba_ex3) all follow the guide; numba_ex3 is the guide's canonical "draw outside, pass arrays in" pattern. The PR also extends past the guide in two ways, which we'd fold into the style guide at merge: The guide gives only one parallel-safe option (draw outside). The PR adds a second: legacy np.random.* inside a prange loop is thread-safe under Numba (each thread gets its own state) — so "legacy = avoid" needs a carve-out for parallel loops. This is the recommended route in the memory-constrained numba_ex4. New content (the scope to weigh): two new linked exercises — numba_ex_race (the race, its two symptoms, the per-thread-state fix, and a spread plot) and numba_ex_draw_speed (pre-draw vs in-loop legacy at n=100M) — plus a discussion block after speed_ex1 on the draw-first/draw-in-loop memory tradeoff. Where we'd welcome your input: this turns a foundational π-Monte-Carlo exercise into a three-exercise treatment of parallel-RNG data races. Does that depth and placement feel right for this lecture, or would you trim or move some of it? (Heads-up, not blocking: the "~150×" / "2 orders of magnitude" / "factor of 2 or 3" timing claims still need re-measuring under the new code — @Chihiro2000GitHub to do. ) |
|
@mmcky -- Thanks for the detailed summary. I agree that the new exercises look like an interesting and potentially valuable addition. Personally, I can see the pedagogical value in including them, but the question of whether this scope is appropriate for this lecture seems beyond what I can confidently assess on my own. For that reason, I also agree that it would be best to hear @jstac’s view on the scope and placement, especially given his role as the author of the lecture. Once the final version of the code and structure is settled, I’ll do another final check, including re-measuring the “~150×” / “2 orders of magnitude” / “factor of 2 or 3” timing claims under the new code. Thanks again. |
jstac
left a comment
There was a problem hiding this comment.
Thanks for this — the data-race teaching addition is genuinely valuable, and I verified the core mechanics hold on numba 0.61.0 / numpy 2.1.3: a Generator passed into a nopython @jit works, a shared Generator under prange compiles and runs, and the same-seed results really do differ (the data race is real and reproducible).
My comments are mostly non-blocking, but two are worth a deliberate decision before merge:
- Hard-coded timing numbers ("~150×", "2 orders of magnitude", "factor of 2 or 3"). The
speed_ex1rewrite moved the RNG calls out of the loop body, so the no-JIT baseline behind "~150×" almost certainly changed — these should be re-measured and, ideally, replaced with qualitative language so they don't drift on future toolchain bumps. - Non-reproducible published output in
numba_ex_race— every CI build will render different numbers/figure. Needs a maintainer call.
Inline notes below cover the timing-comparison asymmetry, name-reuse fragility, and a few prose/formatting fixes.
|
|
||
| {ref}`speed_ex1` and {ref}`numba_ex3` both estimate $\pi$ by Monte Carlo from random samples in the unit square. | ||
|
|
||
| We generate them here and store them in `u_draws` and `v_draws` so that we can use them in both exercises and compare results |
There was a problem hiding this comment.
Pedagogical placement + nit. Generating u_draws/v_draws here, at the top of the Exercises section before speed_ex1 is even posed, partly gives away the approach the student is asked to discover. Consider introducing the shared draws closer to where they're first consumed (or framing them as "we'll reuse these in the solutions").
Also a missing period at the end of this line: "...compare results**.**"
| calculate_pi(u_draws, v_draws) | ||
| ``` | ||
|
|
||
| If we switch off JIT compilation by removing `@jit`, the code takes around |
There was a problem hiding this comment.
Hard-coded timing number is likely stale now. The old calculate_pi drew randoms inside the loop (np.random.uniform(...) calls); the new version just indexes into pre-drawn arrays. Removing the RNG calls from the loop body changes the no-JIT Python baseline substantially, so the "~150×" ratio this sentence quotes — and the "2 orders of magnitude" claim two lines down — were measured against the old code and are probably no longer accurate.
More broadly, hard-coding exact multipliers into the lecture is fragile: they drift silently on every NumPy/Numba/hardware bump. Recommend re-measuring under the new code and softening to qualitative language, e.g. "more than two orders of magnitude faster on our machine" / "dramatically faster", rather than a precise figure.
|
|
||
| ```{code-cell} ipython3 | ||
| with qe.Timer(): | ||
| calculate_pi_in_loop(rng, n) |
There was a problem hiding this comment.
Unfair timing comparison. calculate_pi(u_draws, v_draws) above is timed twice (so the 2nd run excludes JIT compilation), but calculate_pi_in_loop is timed once — so its single measurement includes compilation overhead. The prose then claims the two "run at similar speed," which this methodology can't reliably support. Add an untimed warm-up call first, or time it twice like the other.
|
|
||
| ```{code-cell} ipython3 | ||
| @jit | ||
| def calculate_pi_in_loop(rng, n): |
There was a problem hiding this comment.
Name reuse is fragile/confusing. calculate_pi_in_loop is defined here as serial @jit, then redefined as @jit(parallel=True) in numba_ex_race. Likewise calculate_pi is serial in speed_ex1 and redefined parallel in numba_ex3. A reader jumping between dropdown solutions can't tell which version is live, and numba_ex_draw_speed (below) silently depends on the redefinition order. Consider distinct names (calculate_pi_parallel, calculate_pi_in_loop_parallel, ...).
|
|
||
| This might suggest that drawing inside the loop is the better default. | ||
|
|
||
| But as we |
There was a problem hiding this comment.
Formatting. There's a stray leading space before "But as we" (renders oddly), and the surrounding sentences are split across paragraph boundaries so paragraphs begin with a lone "The" ("The first approach / must hold...", "The / second draws..."). Please reflow these into normal paragraphs without the mid-sentence hard breaks.
| (If you are executing locally, you will get different numbers, depending mainly | ||
| on the number of CPUs on your machine.) | ||
|
|
||
| Notice that we drew all of the random points *before* the loop and passed them in |
There was a problem hiding this comment.
Same hard-coded-number concern as in speed_ex1: the "a factor of 2 or 3" speed-up quoted just above (line ~726) is a precise figure baked into the text. It's heavily machine/thread-dependent and easy to leave stale. Suggest qualitative phrasing ("a modest speed-up, depending on your core count").
| plt.show() | ||
| ``` | ||
|
|
||
| Both bands are centered on $\pi$, but the band associated with the data race is much wider than the other one and narrows very slowly as the sample size grows. |
There was a problem hiding this comment.
Two things on this data-race figure:
- Non-reproducible published output. This exercise intentionally produces different numbers/figure on every execution. QuantEcon runs notebooks in CI and publishes the rendered HTML, so this cell's output and the plot will change on every build. Worth an explicit maintainer decision (@mmcky / @HumphreyYang) — accept as a deliberate exception, or keep the race in prose while stabilizing the figure.
- Wording may overstate the effect. I reproduced the race on numba 0.61 / numpy 2.1 (8 threads): same-seed results do differ ✅, but the race std was only ~1.7× the legacy std at n=1e5, and the legacy std was already near the theoretical ideal. "much wider ... narrows very slowly" may be stronger than the rendered figure shows; the magnitude is thread-count-dependent. Suggest checking the figure on the build hardware and softening if needed.
| with qe.Timer(): | ||
| u_draws = rng.uniform(size=n) | ||
| v_draws = rng.uniform(size=n) | ||
| calculate_pi(u_draws, v_draws) |
There was a problem hiding this comment.
This solution calls calculate_pi(u_draws, v_draws) expecting the parallel version from numba_ex3 — but that only holds because calculate_pi was redefined further up (see the naming note in speed_ex1). The comparison is correct only by top-to-bottom execution order. Renaming the parallel function would make this robust and self-documenting.
|
@mmcky @HumphreyYang @Chihiro2000GitHub Sorry for the long message -- obviously Claude generated -- probably best handled in another Claude pass that i'll leave to you guys. Great work clarifying these issues! I like the new exercises. We're moving towards JAX and away from Numba for various reasons --- and one is the careful and intuitive treatment of random numbers in JAX under parallelization. However, while we're still using Numba, these points are important to clarify. Much appreciated. I mentioned above that we should try to avoid specific numbers baked into text, certainly 150x and even phrases like "2 orders of magnitude". I used to do this but I'm trying to avoid it for obvious reasons. Let's try to use phrases like "large" or "substantial" and let the code outputs speak for themselves. (Worth adding to the style guide?) I'll leave this for you guys to wrap up. |
Summary
This PR migrates legacy NumPy random API usage in
numba.mdas part of QuantEcon/meta#299.Details
The Numba/JIT-related classification and changes in this PR follow the guidance at https://manual.quantecon.org/styleguide/code.html#numpy-random-number-generation. I would be grateful if you could refer to the Numba section of this page when reviewing.
Case A (
update(), main text,@jit): Left unchanged. Althoughupdate()is@jit-decorated, it is called from aprangeloop incompute_long_run_median_parallel, making it unsafe to pass a shared Generator. Flagged for reviewer judgment.Case B (speed_ex1 solution,
@jit/ no parallel):rng = np.random.default_rng()placed before the@jitdefinition; signature changed tocalculate_pi(rng, n=...);np.random.uniform→rng.uniform. Call sites updated. Note: passing a Generator into a@jitfunction may require Numba to use object mode. I checked that the updated code runs as expected on my side, but I would appreciate reviewer confirmation.Case C (speed_ex2 solution, plain Python / later JIT-compiled via
jit(compute_series)): Same pattern as Case B. Signature changed tocompute_series(n, rng);np.random.uniform→rng.uniform. Both the pure Python andjit-compiled call sites updated.Case D (numba_ex3 solution,
@jit(parallel=True)+prange): Draws lifted outside theprangeloop.rng,u_draws, andv_drawsdefined before the@jit(parallel=True)definition; signature changed tocalculate_pi(u_draws, v_draws); loop length derived fromlen(u_draws).n = 1_000_000kept to match the original example size.Case E (numba_ex4 solution,
@jit(parallel=True)+prange): Legacynp.random.randn()calls kept intentionally as a narrow memory-constrained exception. Pre-allocating(M, n)shock arrays withM = 10_000_000andn = 20would require approximately 3.2 GB, which is likely beyond what most readers' machines can comfortably accommodate. A short comment has been added inside the loop explaining this.Hi @mmcky and @HumphreyYang, I'd be grateful if you could take a look when you have time.