How long should it take to send 50,000 messages?

Published
2020-08-26 14:28
Written by

At the Progressive Technology Project, we have been struggling with getting our mass emails out more quickly for quite some time. However, since we went from 5.13 ESR to 5.21 things seemed to have gotten worse. And there is no change with the latest 5.27 ESR. After countless hours trying to sift through all the factors (of course there are way more factors then just the CiviCRM core version!), I'm turning to the community for ideas.

First, let's look at some numbers. Here's a query I wrote to try to get some benchmarks. I'm runing it on a database that sends a lot of large mailings:

MariaDB [xxxxxxxxxxx]> SELECT mailing_id,start_date, end_date, job_limit, ROUND(job_limit / (time_to_sec((TIMEDIFF(end_date, start_date))) / 60)) AS emailsPerMinute FROM civicrm_mailing_job WHERE job_type = 'child' AND job_limit > 5000 AND status = 'Complete' ORDER BY id DESC limit 8;
+------------+---------------------+---------------------+-----------+-----------------+
| mailing_id | start_date | end_date | job_limit | emailsPerMinute |
+------------+---------------------+---------------------+-----------+-----------------+
| 530 | 2020-08-19 14:42:31 | 2020-08-20 00:52:37 | 46769 | 77 |
| 531 | 2020-08-19 09:52:27 | 2020-08-19 19:26:04 | 46782 | 82 |
| 529 | 2020-08-17 16:12:30 | 2020-08-18 01:41:44 | 46825 | 82 |
| 528 | 2020-08-13 10:52:57 | 2020-08-13 16:25:16 | 46933 | 141 |
| 527 | 2020-08-12 12:42:40 | 2020-08-12 18:15:02 | 46995 | 141 |
| 526 | 2020-08-11 13:02:41 | 2020-08-11 18:33:54 | 47069 | 142 |
| 525 | 2020-08-09 11:22:46 | 2020-08-09 17:20:06 | 47046 | 132 |
| 516 | 2020-08-06 09:02:48 | 2020-08-06 14:45:38 | 47093 | 137 |
+------------+---------------------+---------------------+-----------+-----------------+
8 rows in set (0.002 sec)

As you can see, between 77 and 137 messages per minute. Really not so great. And keep in mind - these numbers show how long it took to shove them into the SMTP server, not how long they took to be delivered.

One option would be to use Mailer Job Size setting - and send multiple batches at the same time. That would definitely help, but when I look at our resource usage, it looks a bit tight on CPU:

We are running about 50 sites on this server (via docker) and we run all cron jobs every 10 minutes. As you can see, it takes quite a hit every time those cron jobs run.

Seeing this CPU usage made me curious. How has our CPU usage changed over time? Here's one server:

This happens to show the spike in CPU usage after we went from 5.13 ESR to 5.21 ESR (the drop in August is because we took this server out of production).

Next, I copied a real live site and set it up to send email to the database (let's eliminate any possible network or smtp issues). I also added a lot of debugging to the flexmailer to try to find out what is taking so long:

Aug 26 16:17:24 [debug] Using flexmail send.
Aug 26 16:17:25 [debug] 2020-08-26 16:17:25: default batcher is starting to fetch contacts from pending tasks.
Aug 26 16:17:25 [debug] 2020-08-26 16:17:25: default batcher reached 1000 contacts, starting to visit
Aug 26 16:17:25 [debug] 2020-08-26 16:17:25: fireWalkBatches compose.
Aug 26 16:17:29 [debug] 2020-08-26 16:17:29: default composer is starting to get tasks.
Aug 26 16:17:29 [debug] 2020-08-26 16:17:29: default composer has finished getting tasks and is starting to evaluate.
Aug 26 16:19:03 [debug] 2020-08-26 16:19:03: default composer has finished evaluating and is starting to get rows.
Aug 26 16:19:24 [debug] 2020-08-26 16:19:24: default composer has finished getting rows.
Aug 26 16:19:24 [debug] 2020-08-26 16:19:24: default composer is starting to get tasks.
Aug 26 16:19:27 [debug] 2020-08-26 16:19:27: fireWalkBatches finished compose, starting send.
Aug 26 16:22:10 [debug] 2020-08-26 16:22:10: fireWalkBatches finished send, starting getCompleted.
Aug 26 16:22:10 [debug] 2020-08-26 16:22:10: default batcher - all delivered.
Aug 26 16:22:11 [debug] 2020-08-26 16:22:11: default batcher reached 1000 contacts, starting to visit

In this run we get about 1000 messages sent in just under 5 minutes (about 200 messages per minute) which isn't bad but could be better. It looks like evaluate (token replacement I presume) takes about 90 seconds and then the mailer takes 2 and half minutes.

Really... no clear smoking gun here. Also, on my laptop I managed to get the evaluation and send times down to 15 seconds and a minute respectfully.

All of this leads me to the conclusion that send mass email is simply CPU bound. And, the CPU usage to run a typical site is quite minimal, but the cron jobs are just killer.

I'm curious to get feedback from others, particularly groups that are running a lot of sites on the same hardware. How do you manage CPU load?

p.s. We're running Debian Buster KVM guests which are running Debian Buster docker images which are running PHP FPM (7.3).

Comments

We encountered this problem a long time ago, and we solved it by throwing a lot of CPU power at the problem. We are currently sending about 8,000 emails/minute for large sends (>100,000 emails). But frankly, we're multi-threading like crazy and using at least 10 VMs to achieve this. Absolutely, it's a CPU intensive operation.

Thanks for the cool sql. I'm also running a collection of sites using docker, some of which send via smtp to the host and others that use an external smtp. The numbers I'm seeing seem to be consistent across the last several months at least. The highest number I've found so far is 600/minute, using 0 on the mailer batch limit with an external smtp (civismtp). The other ones using smtp to host are all limited to 1000 or less per batch and seem to only be limited by that. I've been running my civimail with a dedicated cron every two minutes (i.e. it only runs the send civimail, not the rest of the cron). So perhaps your slowdown is because the civi cron is now doing a bunch more stuff that you don't even want or need to do? Running the civi crons for a specific job for all relevant docker instances via a system cron was a little bit tricky permissions-wise, but I have a script I'd be happy to share. Or you could just check that perhaps some of the new system jobs set to "Always" could be updated to just "Hourly".

Thanks Alan for the excellent suggestions. I particularly like your approach to separating out mail cron job from other cron jobs. Before reading your post, I changed the cron job to run every 15 minutes and to sleep for 10 seconds between jobs. That really smoothed out the spikes. But not sure what impact it will have on the mail delivery.

I'm impressed by 600 per minute. I would really like to have that! Are you using flexmailer/Moscaico. I wonder if it is related to number of tokens to be replaced? Number of extensions enabled?

Yes, that 600/minute is using flexmailer and Mosaico, but probably not many tokens. I guess tokens could be very expensive, depending on the tokens in use - assuming you'd have to do multiple api calls for each item going out, then 600/minute translates to 10/second. If in fact your slowness is from this, you might be able to see it in the load on your mysql server perhaps?

We have just seen a huge drop in send speed performance since upgrading to 5.27.6 (from 5.27.0).
From 300-400 per minute, to 30-40 per minute.
We are investigating what might have caused this.

No major changes in recent versions up to 5.24, about 500/minute using smtp to localhost.

Sorry, to update my previous comment. Our slow-down was actually due to our firewall refusing connections, so unrelated!

We achieve sending of up to 20 per second on average.

We do this by splitting mailings into batches of 4000, so our settings are:
Mailing batch limit: 4000
Mailer throttle time: 5000 (=0.005s)
Mailer job size: 4000

And then we call the process_mailings job using a seperate cron call, every 3 minutes.

This means that once a large mailing has 'ramped up' we have up to ~8 jobs being processed in parallel, and our send rate peaks at the limit set by our SMTP provider of 50 per second.

Smaller mailings (<4000) actually send slower, because they never get chunked to send in parallel.

The trick is finding the right balance of settings that work for your sending volumes!

Note that although we set the mailer cron job limit to 6, this setting doesn't actually do anything below MySQL 5.7.5 (see https://issues.civicrm.org/jira/browse/CRM-18011)

So to answer your question, we reckon 50,000 emails should go in about 45 minutes ;]