Epoll
For HTTPd, Epoll is a bonus and not mandatory, you should only attempt his implementation if you are already confident in your server functionalities.
This guide should work as an introduction to epoll and does not constitute
a replacement for the epoll(7) man page as well as the man pages for the
associated functions.
What is Epoll ?
epoll(7) is a Linux kernel API used for
asynchronous programming,
particularly for handling I/O (Input/Output) operations. It is a
high-performance tool crucial for building servers that need to handle
thousands of connections at the same time.
Instead of busy waiting (constantly checking if data is ready, which
wastes CPU), epoll allows the kernel to monitor many data sources
simultaneously. It then notifies the program only when data is available on
one or more sources. This lets the program effectively switch between tasks
(like handling other connections) while waiting, making it significantly
more efficient and scalable.
Epoll C API
The Epoll C API is accessible with the following header.
#include <sys/epoll.h>
The Epoll api uses bit masking extensively, you should be familiar with this concept before continuing this guide.
The Epoll structures
typedef union epoll_data
{
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
struct epoll_event
{
uint32_t events; /* Epoll events (EPOLLIN / EPOLLOUT / EPOLLET / ...) */
epoll_data_t data; /* User data variable */
};
The Epoll functions
All Epoll functions returns -1 on error with errno set to indicate the error.
You should always check the return code of epoll functions to avoid errors.
int epoll_create1(int flags);
epoll_create1(2) creates a new epoll instance. EPOLL_CLOEXEC is the only
available flag. Check the man page for more information.
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *_Nullable event);
epfd: the epoll instanceop: the operation to do with the fdfd: the file descriptor to updateevent: the event to fill
The operations available are the following ones:
EPOLL_CTL_ADD: add fd to the interest listEPOLL_CTL_MOD: update the settings associated with fdEPOLL_CTL_DEL: remove fd from the interest list
int epoll_wait(int epfd, struct epoll_event events[.maxevents], int maxevents,
int timeout);
epoll_wait(2) fills the events parameters with a list of epoll_event from
the ready list and returns the number of events filled.
epfd: the epoll instanceevents: buffer that will be filled with events from the ready listmaxevent: the maximum number of events returnedtimeout: time in millisecondsepoll_wait(2)will block, -1 disables the timeout
Simple usage example
For readability purposes, in every example on this page, the return value of
every syscall (open, epoll_..., socket, etc.) will not be checked. Do not
forget to do it!
if (function(...) == -1)
{
// Handle error
}
Initializing
// Return values are not checked for readability purposes
int fd4 = open("file4", O_RDWR, 0);
int fd5 = open("file5", O_RDWR, 0);
int epollfd = epoll_create1(0);
int fd7 = open("file7", O_RDWR, 0);
As you should know, at the beginning of the program, 3 file descriptors are
already opened. While the result of epoll_create1 is also a fd, it is better
if you think of it as a container for other fds.
Using the Epoll instance
struct epoll_event event; // Event struct used for epoll_ctl
struct epoll_event events[MAX_EVENTS]; // Event array filled by epoll_wait
// Add FD7 to the interest list
event.data.fd = fd7;
event.events = EPOLLOUT | EPOLLET; // Bitmask flags
epoll_ctl(epollfd, EPOLL_CTL_ADD, fd7, &event);
// Add FD5 to the interest list
event.data.fd = fd5;
epoll_ctl(epollfd, EPOLL_CTL_ADD, fd5, &event);
// Add STDIN to the interest list
event.data.fd = STDIN_FILENO;
event.events = EPOLLIN;
epoll_ctl(epollfd, EPOLL_CTL_ADD, STDIN_FILENO, &event);
As you can see, the epoll_event passed as reference in epoll_ctl can be
reused. The reason is that it is copied internally by epoll.
/*
Blocks until a write event happens in either FD5 or FD7 or until the
STDIN fd is ready to be read. It will fill the events array with ready.
*/
int nb_fds_ready = epoll_wait(epollfd, events, MAX_EVENTS, -1);
for (int i = 0; i < nb_fds_ready; i++) // Iterate over the ready list
{
struct epoll_event ready_event = events[i];
do_use_fd(ready_event.data.fd);
}
For example, if a write was made on STDIN, the file descriptor will go into
the ready list. When epoll_wait is later called, it will be returned by epoll
as well as the events triggered (mostly EPOLLIN or EPOLLOUT). More information
can be found in the epoll_ctl(2) man page.
Edge triggered and Level Triggered
Epoll have 2 modes:
- Level Triggered
- Edge Triggered
Edge Triggered mode can be enabled per file descriptor using the EPOLLET
flag.
While the older poll(2) call can only support level-triggered mode,
epoll adds a new mode which is called edge triggered.
Level triggered is simple, the file descriptors monitored using epoll
returns from epoll_wait when the fd is available.
Which means for EPOLLIN, when data is available to be read and when using
EPOLLOUT, when data can be sent using server.
Edge triggered is different, the file descriptors monitored using epoll returns
from epoll_wait only when the state changes even if it is already ready.
Which means for EPOLLIN, when new data is available to be read and for EPOLLOUT, when the file descriptor goes from a non-writable to a writable state.
You should be careful when using EPOLLIN | EPOLLOUT | EPOLLET to check what
event was triggered in your event loop.
for (int i = 0; i < nb_fds_ready; i++)
{
struct epoll_event ready_event = events[i];
if (events[i].event == EPOLLIN)
do_use_fd_read(ready_event.data.fd);
else if (events[i].event == EPOLLOUT)
do_use_fd_write(ready_event.data.fd);
else
// Other possible events if needed
}
Epoll for sockets
The following example is heavily inspired by the example from the epoll(7)
man page.
struct epoll_event event; // Event struct used for epoll_ctl
struct epoll_event events[MAX_EVENTS]; // Event array filled by epoll_wait
int server_socket;
/* Code to set up listening socket, 'server_socket',
(socket(), bind(), listen()) omitted. */
epollfd = epoll_create1(0);
event.events = EPOLLIN;
event.data.fd = server_socket;
epoll_ctl(epollfd, EPOLL_CTL_ADD, server_socket, &ev);
while (true)
{
int nb_fds_ready = epoll_wait(epollfd, events, MAX_EVENTS, -1);
for (int i = 0; i < nb_fds_ready; i++) // Iterate over the ready list
{
struct epoll_event ready_event = events[i];
if (ready_event.data.fd == server_socket)
{
// A client is trying to connect to our socket
/* Accept the client
* Note: `accept4` allows us to directly set the client socket to
* be nonblocking. */
int client_socket =
accept4(server_socket, NULL, NULL, SOCK_NONBLOCK);
// Choose the event type to handle
event.events = EPOLLIN | EPOLLOUT | EPOLLET;
event.data.fd = client_socket;
// Add the client to the interest list
epoll_ctl(epollfd, EPOLL_CTL_ADD, client_socket, &event);
}
else
{
/* An already connected client is ready for transfer
* What you do here will depends on what you want and what
* epoll_event you put in the client.
*
* For example, if you put EPOLLIN | EPOLLOUT | EPOLLET, you need
* to check what was the event triggered before attempting any
* reading or writing on the client socket.
*
* If you used EPOLLET, you need to read/write until EAGAIN.
*
* If you used only EPOLLIN and you know need to send data to a
* client, you can use epoll_ctl_mod to change the client_socket
* mode.
*
* Do not forget to remove the file descriptor from the interest
* list using epoll_ctl when you finished dealing with the client.
*/
}
}
}
For these examples, we always used the fd attribute of the epoll_data(3)
union. If you need more information to be linked to your client
(e.g., previously sent data), do not hesitate to use the void * pointer.
You should use either recv(2) or send(2) per epoll event, not both.
Single client connection animation
Below is an animation of the server socket setup and a client connection to our socket.
Multi-client connection animation
Below is an animation of 2 events that are happening at the same time:
- A new client is trying to connect
- The first client is sending data