Basically, I’m sick of these network problems, and I’m sure you are too. We’ll be migrating everything: pictrs, frontends & backends, database & webservers all to 1 single server in OVH.

First it was a cpu issue, so we work around that by ensuring pictrs is on another server, and have just enough CPU to keep us all okay. Everything was fine until the spammers attacked. Then we couldn’t process the activities fast enough, and now we can’t catch up.

We are having constant network drop outs/lag spikes where all the networking connections get “pooled” with a CPU steal of 15%. So we bought more vCPU and threw resources at the problem. Problem temporarily fixed, but we still had our “NVMe” VPS, which housed our database and lemmy applications showing an IOWait of 10-20% half the time. Unbeknown to me, that it was not IO related, but network related.

So we moved the database server off to another server, but unfortunately that caused another issue (the unintended side effects, of cheap hosting?). Now we have 1 main server accepting all network traffic, which then has to contact the NVMe DB server and pict-rs server as well. Then send all that information back to the users. This was part of the network problem.
Adding backend & frontend lemmy containers to the pict-rs server helped alleviate and is what you are seeing at the time of this post. Now a good 50% of the required database and web traffic is split across two servers which allows for our servers to not completely be saturated with request.

On top of the recent nonsense, it looks like we are limited to 100Mb/s, that’s roughly 12MB/s. So downloading a 20MB video via pictrs would require the current flow: (in this example)

  • User requests image via cloudflare
  • (its not already cached so we request it from our servers)
  • Cloudflare proxies the request to our server (app1).
  • Our app1 server connects to the pictrs server.
  • Our app1 server downloads the file from pictrs at a maximum of 100Mb/s,
  • At the same time, the app1 server is uploading the file via cloudflare to you at a maximum of 100Mb/s.
  • During this point in time our connection is completely saturated and no other network queries could be handled.

This is of course an example of the network issue I found out we had after moving to the multi-server system. This is of course not a problem when you have everything on one beefy server.


Those are the board strokes of the problems.

Thus we are completely ripping everything out and migrating to a HUGE OVH box. I say huge in capital letters because the OVH server is $108/m and has 8 vCPU, 32GB RAM, & 160GB of NVMe. This amount of RAM allows for the whole database to fit into memory. If this doesn’t help then I’d be at a loss at what will.
Currently (assuming we kept paying for the standalone postgres server) our monthly costs would have been around $90/m. ($60/m (main) + $9/m (pictrs) + $22/m (db))

Migration plan:

The biggest downtime will be the database migration as to ensure consistency we need to take it offline. Which is just simpler than

DB:

  • stop everything
  • start postgres
  • take a backup (20-25 mins)
  • send that backup to the new server (5-6 mins (Limited to 12MB/s)
  • restore (10-15 mins)

pictrs

  • syncing the file store across to the new server

app(s)

  • regular deployment

Which is the same process I recently did here so I have the steps already cemented in my brain. As you can see, taking a backup ends up taking longer than restoring. That’s because, after testing the restore process on our OVH box we were no where near any IO/CPU limits and was, to my amazement, seriously fast. Now we’ll have heaps of room to grow with a stable donation goal for the next 12 months.

See you on the other side.

Tiff

  • Tiff@reddthat.comOPM
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    8 months ago

    That’s when US timezones wakes up. We physically cannot accept more than 3 requests per second. Physically being the actual network physical limits ( of 3 x 287ms = 861ms, we used to be 930ms+. The server move got us 21ms closer!). LW generates more than 3 activities per second during US “awake” time zones. So we have a period of 8 hours where we need to catch up.

    Like I’ve said in our forcing federation post. There isn’t anything to worry about because we are completely up-to-date on posts and comments because of our sync script.

    It’s just the sequential nature of Lemmy. I’m going to test a new container in the next 12 hours which removes the blocking metadata generation from the accepting of activities. That way we can guarantee at least 3 activities a second.

    Realistically, that is a minor fix but it won’t help with those graphs in the long term. We will need to have parallel sending, for it ever scale.

    On a side note while we were on our old server and were using our forcing federation script, we had it set to 10 parallel requests. It didn’t even worry about it. I saw no increase in server load. Which is good news for the lemmyverse in general, as everyone will be able to accept the new parallel sending without needing to increase their hardware.

    Tiff

    • Blaze@reddthat.com
      link
      fedilink
      English
      arrow-up
      4
      ·
      8 months ago

      Thank you for the detailed answer!

      There isn’t anything to worry about because we are completely up-to-date on posts and comments because of our sync script.

      Sorry, it’s a bit late for me on this side, but if I understand correctly, posts and comments are indeed up-to-date, but upvotes are synchronized later, is this correct?

      Thank you for the work as always!

      • Tiff@reddthat.comOPM
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        8 months ago

        but upvotes are synchronized later

        Correct. All votes are syncronised eventually.