In the previous post, I introduced my idea of implementing the Mandelbrot set on an FPGA and described the Mandelbrot set equation and the Escape-time coloring algorithm. This post will talk about the implementation.

## Modeling the Mandelbrot set

I consider it a good idea to start any FPGA implementation work with some planning, and in the case of an algorithm, the first thing you’d want to do is write a simple implementation of it yourself, just so you can verify that your understanding of the algorithm is correct. My implementation is a simple C program using double types, which outputs an image in the PPM format.

The implementation generally does the following:

- sets up a palette for the image
- writes the appropriate headers for the image
- walks over all pixels of the output image
- maps the current pixel coordinate into the mathematical Mandelbrot space
- calculates the Mandelbrot set equation for up to maximum iterations, or until the value limit is reached
- uses the number of iterations as an index into the RGB palette
- writes the RGB values to the output image

Running this program will produce the following image:

Seems to work just fine!

Since floating-point data types are not really appropriate for FPGAs, you need to consider how to convert the floating-point arithmetic to integer arithmetic, or more specifically, to fixed-point arithmetic. Generally, this can be quite a tiresome step, since you need to carefully consider how many integer / fractional bits to use in each step, when to cut / round bits, weighing in hardware costs, etc.

Luckily, the Mandelbrot set calculation is relatively simple and limited in range, so you can basically eyeball the correct number of bits to use. My implementation is using a fixed point format *s4.49* (*s* represents the single sign bit, the number *4* before the decimal point is the number of integer bits and the number *49* after the decimal point is the number of fractional bits).

For this algorithm, the most critical parts are before the decimal point – we need signed numbers, and we can eyeball the maximum value of the calculations to always be smaller than *16.0*, so *4* integer bits are needed. The fractional part seems arbitrary and the reasoning will be explained a little later, but this number represents the precision of our calculations, which are important when zooming into the Mandelbrot set.

Another thing to consider is the maximum number of iterations for a single point. For the moment, I decided to go with *256* maximum iterations, mostly because that number nicely fits in an *8bit* variable. But this decision is pretty much arbitrary, as the most appropriate number of iterations is hard to pinpoint. Maybe I’ll revisit this and implement some sort of dynamic approach according to the current ‘scene’ – position and zoom. For now, it should do.

## FPGA implementation preparations

This is my first time working on the Terasic DE10-nano FPGA board, so some introductory research was needed. Terasic has a lot of documentation on all of its boards available on its site which is worth at least a cursory glance. The most useful one is the board schematic and their example projects.

From the schematic, you can see that the HDMI TX IC used is the Analog Devices’ ADV7513, and that is excellent since the interface to the FPGA TX is pretty much the same as the one for the ADV7123, which is used on a lot of FPGA boards for VGA analog output. The HDMI TX also requires some configuration through the I2C bus, I decided to just reuse the code from the sample HDMI sample project for this part.

Terasic also provides software to create the Quartus project for you, which is great, because manually editing pin constraints is not a fun exercise.

Of course, you’d want to first test any new FPGA board with the vendor-provided software – I tried the Linux running on the DE10-nano:

Another important document worth a glance is the FPGA device user manual, or in this case, the

. While an FPGA device is generally pretty configurable, there are some fixed blocks inside, which we’ll use generously.One of the important ‘fixed’ blocks are the memory blocks. From the handbook, you can see that the memory blocks can be configured in *1k x 10* or *1k x 8* sizes, which will be used extensively in this project, as will be described later.

Another important block is the so-called DSP block, which is basically a multiplier with some additional logic. The most important property of the DSP block is the supported input operand width. From the handbook, you can see that a single DSP block supports up to *27×27* multiplier operands, and DSP blocks can be chained together. For better precision in the Mandelbrot calculations, I decided to use two DSP blocks for a single multiplication – this closely matches the precision of a double floating-point datatype on a PC and will provide solid Mandelbrot zooming. If you wondered where that number *49* in the selected fixed-point format (*s4.49*) above comes from, this is it – since the whole multiplier input is *54* bits, and we need to take into account the sign and integer parts, we can get a maximum of *49* fractional bits from the FPGA hardware.

## FPGA implementation

Besides the basic scaffolding for the project, you’d usually want to get the clocks, PLLs, and resets working – all three can easily be tested with the Hello World of FPGAs – a LED blinky.

### Video

After that, some real work – the video sync generator, which is basically just a couple of counters and comparators – you need to count the video columns and rows and output an active signal together with horizontal and vertical sync signals. You can see my implementation here. If you output the counters from the video sync generator module, together with some helper signals, you can already generate a video output. See my version here, and the output below:

With the video sync signals working, it’s time to decide how to handle the actual pixels. Since the Mandelbrot set implementation has a limited number of iterations as its output, the logical choice is to use a color palette. That implies two sets of memories for the final pixel output – a video index RAM, which holds the palette index (and is basically just a number of iterations for that particular pixel), and a palette memory. The index RAM requires *image_width* x *image_height* entries that are *8bits* wide and the palette memory then has *2^8* (*256*) entries and must be *24bits* (*8*+*8*+*8* for RGB) wide. You can see my implementation here.

I also wrote an implementation of a text console, since I wanted to put some textual information on the video output, like the current position & zoom, and the number of iterations and time required to render the whole screen. The text console is in a way similar to the normal pixel output, in that you have a character index RAM and a palette of sorts, in this case, a character ROM memory. What is different is how you access the pixels of the characters since you need to read the same character multiple times – *font_width* times horizontally, and *font_height* times vertically. You can see my implementation here, and the output of the console, and some color background below:

### Mandelbrot set calculation

Mandelbrot set calculation can be thought of as consisting of two parts:

- a location walker, which must go over all pixel coordinates and output the coordinates in mathematical space
- an iteration calculator, which must calculate the number of iterations for each coordinate

These two parts strongly imply a sort of producer-consumer relation – since the coordinates can be calculated quickly and easily, once you figure out your initial position (upper left corner) and the step in horizontal and vertical directions (Mandelbrot set position and zoom), but the per-pixel iterations take a number of cycles on average. I split the Mandelbrot set calculations into a mandelbrot_coords module and a mandelbrot_calc module, with a FIFO in between.

The coordinate calculation is pretty easy, it basically boils down to a couple of adders:

always @ (posedge clk, posedge rst) begin if (rst) begin cnt_x <= #1 VMINX; cnt_y <= #1 VMINY; cnt_adr <= #1 'd0; cnt_en <= #1 1'b1; man_x <= #1 MAN_DEF_X0; man_y <= #1 MAN_DEF_Y0; end else if (clk_en) begin if (init) begin cnt_x <= #1 VMINX; cnt_y <= #1 VMINY; cnt_adr <= #1 'd0; cnt_en <= #1 1'b1; man_x <= #1 MAN_DEF_X0; man_y <= #1 MAN_DEF_Y0; end else if (out_rdy && cnt_en) begin if (cnt_x == VMAXX) begin if (cnt_y == VMAXY) begin cnt_en <= #1 1'b0; end else begin cnt_y <= #1 cnt_y +'d1; cnt_adr <= #1 cnt_adr +'d1; man_y <= #1 man_y + inc_y; $display("Mandelbrot: processing line %d", man_y); end cnt_x <= #1 VMINX; man_x <= #1 MAN_DEF_X0; end else begin cnt_x <= #1 cnt_x + 'd1; cnt_adr <= #1 cnt_adr + 'd1; man_x <= #1 man_x + inc_x; end end end end

I had a simulation/implementation mismatch here, which took a while to figure out – the problem is that Quartus can only do calculations with Verilog parameters with 32bit numbers, while the Icarus Verilog simulator seems to use larger datatypes (*come on Intel, step up your game – Quartus seems really *really* dated compared to Vivado, this problem, really poor SystemVerilog support, … and this is coming from someone who always preferred Intel / Altera FPGAs …*).

For the iteration calculation, the implementation is, again, simple (deceptively so!):

assign xx_mul_comb = x*x; assign yy_mul_comb = y*y; assign xy_mul_comb = x*y; assign xx_comb = xx_mul_comb[2*FPW-1-FP_S-FP_I:FPW-FP_S-FP_I]; assign yy_comb = yy_mul_comb[2*FPW-1-FP_S-FP_I:FPW-FP_S-FP_I]; assign xy2_comb = {xy_mul_comb[2*FPW-2-FP_S-FP_I:FPW-FP_S-FP_I], 1'b0}; assign limit = {1'h0, 4'h4, {FP_F{1'h0}}}; // 4.0 assign niters_check = niters[IW+1-1:1] >= (MAXITERS-1); assign limit_check = (xx + yy) > limit; assign check = niters_check || limit_check; assign x_comb = xx - yy + x_man_r; assign y_comb = xy2 + y_man_r; always @ (posedge clk) begin if (clk_en) begin if (in_vld && in_rdy) begin adr_o <= #1 adr_i; x_man_r <= #1 x_man; y_man_r <= #1 y_man; x <= #1 'd0; y <= #1 'd0; xx <= #1 'd0; yy <= #1 'd0; xy2 <= #1 'd0; niters <= #1 'd0; end else if(!check) begin x <= #1 x_comb; y <= #1 y_comb; xx <= #1 xx_comb; yy <= #1 yy_comb; xy2 <= #1 xy2_comb; niters <= #1 niters + 'd1; end end end

I was sure I’ll need to implement some sort of a state machine to handle initialization of the variables and the final check, but, by a lucky coincidence, this code works just fine – there is no need to delay the checking of x^2 + y^2 calculation, as the result starts from 0, so initially, it will always produce correct results.

Throwing all the code together will produce the following result:

So here it is, a Mandelbrot set calculated on an FPGA, in glorious 640×480 resolution with 27bit fixed-point precision! See all of my code in my mandelbrot_fpga project on Github, tagged with v0.4.

## Next steps

With the initial version working, the project still needs some additions to make it better and more useful:

- increasing the precision to 54bits – this was omitted since Quartus doesn’t like 64bit integers in parameters, and this needs to be handled slightly differently
- moving the Mandelbrot calculations to a separate, faster clock – this was the point after all – speed! Currently, the ~25MHz video clock is used for all logic
- multiplying the mandelbrot module to use up all available DSP elements of the (pretty generous) Cyclone V device on the DE10-nano board
- better pipelining for the multipliers – a couple of registers need to be added, and a single calculation module updated to handle multiple coordinates at once
- making the logic more configurable – currently, a lot of config is handled with params, which will need to be proper inputs
- convert the project to be MiSTer-compatible
- last but certainly not least – add the ability to actually zoom and move around the Mandelbrot set – I’ll probably just add a soft-core CPU to handle this part, with inputs from keyboard/joypad

Stay tuned for the next post!