Lessons from Let's Do This#
I joined Let's Do This as a Software Engineer in September, 2022. Back then I was relatively early in my career, with just two years of experience at a much smaller startup before this role. Personally and professionally, this has been a period of intense growth in my career, and I am very grateful for it.
It has been three and a half years since then, and soon my time here will come to an end. I therefore wanted to take some time to reflect and record my learnings, lest I forget important lessons over time.
Please sit down with a nice cup of chai, as this is going to be a long one.
Some background#
Let's Do This builds tools to help organisers run some of the largest mass participation sports events in the world, such as the London Marathon, the Peachtree 10K, and the Great North Run.
Our system consists of the following parts:
- A Next.js based web frontend for event organisers and participants.
- A React Native mobile app for raceday support and participant check-in.
- Internal microservices that communicate over gRPC, fronted by a GraphQL server to present a unified API to our clients.
- MongoDB for operational data.
- Snowflake for reporting and BI.
The stack also consists of several other smaller pieces, such as SQS + lambdas for asynchronous workloads, and an event bridge for fan-out signals. All of this is hosted on AWS, except for our databases which are managed by their respective providers (MongoDB Atlas, and Snowflake).
All of our services are written in TypeScript, and use Node.js as the runtime.
My previous experience was mostly around Firebase and serverless functions, so this stack was significantly more complicated than what I was used to. Learning how to work with it was an interesting challenge, and just a couple of weeks in I found myself to be quite productive.
Years later, this stack has held up quite well for the most part. I disagree with some of the choices we made along the way, which I'll outline in the sections to follow.
Lessons#
Here is a non-exhaustive, unordered list of lessons I learned while working here. Most of these are engineering focused, but I have some personal ones in there too.
I hope this list is useful to you. If it is, feel free to let me know about it.
Invest in observability#
Before LDT, I had only ever used logs for observability in production. Simple logs can go a long way, and you can use some tricks to make them even more useful. For a complex system though, you also want traces, monitors, alarms, performance profiles, and reporting dashboards.
A good observability setup lets you see the entire lifecycle of a request. LDT uses Datadog internally, and our setup gives us the ability to see each request as it originates from the frontend, goes through GraphQL, and filters down to a number of internal services. We can also see database queries, related logs, metrics and error traces generated as part of it.
This is invaluable for several reasons:
- Startups have to move quickly, often leaving behind a trail of tech debt. This inevitably causes fires, and when it does your observability setup MUST be rock solid to investigate and resolve them quickly.
- As you scale, parts of your product will slow down. Some pages will take ages to load, some operations will overload the CPU on your containers, some requests will overwhelm your database. Without proper latency figures, CPU metrics or profiles, it will be very challenging to investigate and fix these problems. You do not want to be guessing what exactly killed your database during a SEV-1 caused by a massive traffic spike.
- The most embarrassing way to learn about an incident is when your customers tell you there's an incident. Active monitors on latency, error rates, and all other use metrics are the best way to ensure you know about incidents before your customers do.
The observability setup at LDT is great, and honestly in a different league compared to the places I had worked at in the past. The company actively invested in it through a Platform team, and in my opinion this has paid for itself several times over.
Sure, we had fires, but we also had the tools to put them out quickly.
MongoDB is great, until it's not#
LDT uses MongoDB as its operational database. It is a document oriented, schemaless, NoSQL database.
Not being constrained by schemas is a blessing in a startup because the ideal data structures often reveal themselves after you have shipped a prototype. Mongo lets you throw data at it, and makes it scale really well. It's also fairly simple to use and packs a tonne of features.
Unfortunately, there are also some rough edges I encountered while working with it.
Evolving schemas#
Unless you're very strict with how you read and write your data, the schemaless nature of MongoDB can be poisonous to your stack. Your data structures will evolve over time, and each change will require you to make one of these choices:
- Backfill your collections to convert documents using old schemas to new ones (aka use migrations). This lets your code be simple, as you only have to deal with one schema at a time.
- Version your schemas, and prepare to deal with all versions of them in any code that touches them.
You can enforce strict schemas with Mongo, but doing so comes with higher CPU loads.
I would recommend using a strict data access layer that can help you manage schemas.
Do NOT use mongoose: it has poor defaults, poor performance, and encourages use of features that will cause you headaches later. Use the official Node.js MongoDB driver instead. It is excellent. Combine it with Zod for schema validation if you need.
If you must use mongoose, add .lean() to your queries to avoid CPU spikes when serialising/deserialising objects from the database.
Too many features#
MongoDB boasts a tonne of features, including vector search, full text search, graph relationships, timeseries collections, change streams, and so much more.
It is a very good database if you just need to store documents and run queries on them. We have had great success with it for such workloads. However, if you step outside this well supported zone to use some of the extra features it offers, be warned -- here be dragons.
- We used Mongo's graph capabilities to build a social networking product, but we found that it does not perform well even under moderate load.
- Unless you use the hosted Atlas Search feature, Mongo's text search indices are not performant and hard to tweak.
- Aggregations are great in theory, but can overwhelm the database quickly. Avoid them if you can, or use them with caution if you can't. Note that some Mongo features require the use of aggregations, including full text search.
Are some of these problems caused by skill issues? Possibly. Would I still recommend using them? No.
Footguns#
Some footguns I've discovered when using Mongo:
- DO NOT use unbounded arrays on your documents. A single large document in a collection can end up slowing down the entire collection. I had to learn this the hard way, when a misbehaving lambda accidentally inserted 70,000 entries into the status log array of a document. This caused all queries over the collection to slow down to a crawl, leading to a SEV-2.
- DO NOT count the number of documents in a large collection. Use
estimatedDocumentCount()instead. If you must usecountDocuments(), use it with an index hint for the_idproperty. - AVOID using
ensureIndexes()to create indexes during deployments or CI runs. If an index fails to create, it will block your deployments. Unfortunately, I have not found a good way to keep indexes in sync between what exists in the codebase, and what exists in the database. - AVOID using ordered events when using Atlas Triggers, as they have a low throughput. Unordered triggers are great.
GraphQL is complicated#
GraphQL is a fantastic technology that affords you an unparalleled level of flexibility. It lets clients fetch all the data they need to render any page in one request, saving tonnes of time in round trips to the backend. It lets you define your API in a schema first format, ensuring the backend and the frontend always stay in sync. It makes it easy to query, poll and subscribe for data. It even offers powerful caching capabilities.
You also don't need it.
I have found GraphQL to be too powerful, too flexible, and too difficult to manage.
- Authz is hard: clients can query data through a graph of interconnected types, so each type can be reached through multiple paths. You need to ensure each path is protected by the correct authz checks, or you risk data leaks. This is surprisingly tricky to get right in practice.
- Performance optimisation is hard: since each type can be reached through multiple paths, it is difficult to predict the hot paths you need to optimise.
- Graphs can get too big: we have a big product surface area, and our GraphQL API reflects that. It's big enough that common visualisers crash (or slow down to a crawl) when trying to render the structure of our graph.
- Browsers don't natively speak GraphQL: if you look at the network inspector tab of your browser talking to a GraphQL API, it's hard to figure out what's going on without drilling into the details of each request. Every request is a POST to the same endpoint, and every response has the same status code, regardless of its semantics.
- Codegen is tricky: to consume GraphQL from TypeScript (or any typed programming language), you must use codegen. Unless you get your settings exactly right, the generated code can bloat your bundle, slowing down page load speeds.
- Hard to hire for: It's harder to hire people familiar with GraphQL than it is to hire people familiar with REST. Thinking in GraphQL requires practice, and most engineers (myself included) end up creating non-idiomatic APIs during the process.
Design your network topology#
The gateway to the LDT backend is our GraphQL service. To serve a request, it calls out to a number of other internal services over gRPC. Internally, every service is free to contact every other service. Due to this flexibility, connections between them have emerged organically, often leading to cyclic dependencies.
In hindsight, we should have been stricter about this design. Cyclic dependencies make it impossible to deploy services in the correct order: a topologically sorted order. They also make it difficult to identify which services are upstream/downstream of a given resource.
Be opinionated about your network topology, and model it as a DAG if possible.
Monorepos are great#
When I joined, our backend code was divided in several git repositories, one for each service. We shared code between them by publishing private NPM packages.
This setup was fine, but quite inefficient:
- Creating a new RPC required three pull requests: one to define the RPC in the shared protobufs package, a second to implement it in a service, and a third to consume it. This meant three sets of review cycles, and three sets of deployments. It also required you to define the perfect RPC signature at the start of the process, because changing it was expensive time-wise.
- Each repository used slightly different tooling: different linting rules, different formatting styles, different code conventions, etc. Contributing effectively across multiple services had significant cognitive overhead.
- Deployments between different services required manual coordination. You'd better make sure you merge code that adds a new feature BEFORE you merge code that consumes it.
We slowly transitioned to a monorepo setup to bring all backend services under one repository. This allowed us to standardise tooling across the entire codebase, removed the need to publish private packages, and greatly reduced the friction for shipping new features.
It was a significant productivity boost, and widely regarded as one of the best developer experience improvement projects ever pursued at the company. I can not recommend it enough.
SQS is not a job queue#
The preferred queuing technology at LDT is SQS. It is cheap, fast, easy to use, and scales well. When combined with lambdas and batch delivery, you can handle large volumes of messages effortlessly. These qualities make SQS an exceptionally good message queue, and it should be your default choice for light async workloads that need reliable execution.
Where SQS falls short is in acting as a job queue. It lacks capabilities that a job queue must support:
- You cannot specify a message priority. For example, if you use SQS to manage sending emails, you cannot assign different priorities to transactional email messages and marketing email messages.
- SQS does not have a concept of job dependencies - all messages are expected to be independent and idempotent. This makes it difficult to model parent-child relationships between jobs.
- You cannot specify an arbitrary delay on message delivery. You can work around this by scheduling cron events on EventBridge, but the best SQS can do natively is to let you delay a message by up to 15 minutes.
- To track job statuses, results, errors and progress updates in SQS, you must pair each message with a corresponding document in a database. Keeping them in sync can be tricky. It also affects throughput, and of course, creates extra database load.
- SQS has no native concurrency controls. The best you can do is control the number of active consumers, for example by setting a
reservedConcurrencyon lambdas.
If you need a job queue, use a tool like BullMQ.
Tracking money is HARD#
Let's Do This sells tickets on behalf of event organisers, taking a small cut from each sale. Ensuring correctness in this seemingly simple financial model is surprisingly hard. For example,
-
Can you support multiple currencies in the same basket?
-
How do you deal with disputed charges?
-
Can you support partial refunds in a buy-now-pay-later transaction?
-
Do you handle forex correctly if your customers choose to pay in a different currency?
-
Do you need to recalculate tax if participants change their addresses post payment?
There are no simple answers to these questions, and the way you solve each case likely causes niche problems for all others. Even tracking something as basic as money coming and money going out is not a trivial problem!
Taxes introduce a whole new layer of complexity to this:
-
Have you hit nexus on your taxable transactions in Atlanta?
-
What is the tax code for a BOLDERBoulder t-shirt shipped to a participant in Texas?
-
Have you already filed taxes for the transaction being refunded? Can you void the transaction, or do you need a reversal?
-
Does your contract require you to file taxes on behalf of this organiser, or do they handle it themselves?
Getting taxes wrong is also quite risky: you either invite expensive fines from the authorities, or your CFO goes to jail.
Unfortunately, I do not have any good solutions to offer here. My general advice:
- If you're a startup that does not have massive financial expertise, lean on Stripe as much as you can.
- Use products like Avalara to help you with tax.
- Track money coming in and going out in a ledger.
- Do not create your own money package. Use something industry standard like dinero or currency.js.
Ignoring DX is a recipe for disaster#
If all of your engineers spend 100% of their time working on product, and 0% of the time working on housekeeping, you will eventually spend 110% of your time on just keeping the lights on.
CI will be flakey. Deployments will be blocked. You will be several major versions behind on your frontend framework. Your database driver version will reach EOL. You won't be able to adopt any new Node.js improvements. Fires will become more frequent. Incident reviews will be ignored. New joiners won't have any accurate docs to read. Life will lose colour, and everyone will be very, very sad.
One way to avoid these issues is to make it someone's job to care about DX. A small team (or even just one engineer) dedicated to solving such problems can deliver substantial productivity gains for everyone else in the company.
Friendships last longer than jobs#
I spend a significant part of my day working with my colleagues. In aggregate, I have probably spent just as much time with them as I have with my partner. It is only natural then, that some colleagues become friends, and some friendships remain even after you leave your job.
Build friendships, and invest in people.
Actively steer your performance reviews#
Actively participate in your performance reviews. If you don't, you may not advance in your career as quickly as you deserve.
Set clear goals with your line manager. Ask your peers for feedback. Record your progress. Be your own cheerleader. Shout loudly in Slack about that amazing feature you just shipped. Help everyone see why you obviously deserve that raise, or that promotion.
Remember that being visible is just as important as being good.
Help everyone level up#
Engineering is a competitive career path, and imposter syndrome is very real. AI only exacerbates the problem: some can wield its power to deliver massive projects in half the time, whereas others find it barely useful.
It is easy to feel that you are not good enough.
You are not delivering fast enough.
You are not making as much money as your peers.
You will be left behind.
Leave it unaddressed and you will sleepwalk into a culture of toxic competition, anxiety, and burn out.
Problems are rarely solved by suffering alone in silence, if ever. The best way to solve this one is by inculcating a culture of sharing knowledge, and helping everyone level up.
LDT handled this really well. We ran three fortnightly sessions: Forum, Bookclub, and Retro.
- Forum was focused on announcements, discussions and collecting feedback on engineering decisions. Topics varied from PSAs on "Using Temporals for date and time" to guides on "Streams in MongoDB".
- Bookclub was focused on sharing knowledge through talks and presentations. "Datadog 101", "how do i dabatase?", "Advanced Datadog", "Jesting around with Vitest", "Datadog 3: Dawn of the dog".
- Retro was focused on, well, retros.
The agenda for all three sessions was always open, and everyone was encouraged to participate. It really helped everyone feel like they were part of one team where everyone was looking out for each other.
Conclusion#
Working at Let's Do This has been a great experience. I feel like a more well-rounded and capable engineer for having spent time here. Of course there are challenges that this place needs to overcome, but what company doesn't? It is filled with talented people, and I will miss working with them.
I will also miss working in this industry, as it is filled with folks who want everyone to be out there and have amazing experiences. Cheering for runners at the finish line of a marathon is exhilarating. It's even better if you volunteer to help run these events, and nothing beats participating in one. Crossing the finish line of the RideLondon 100 miler on Tower Bridge remains a personal highlight for me.
What's next?#
After a short break, I'll be heading to Encord. You can find me on Bluesky to follow along for more updates.