How we reduced our data generation times by 50%
A walkthrough of how we reduced the time it takes to generate data in Neosync by 50% + benchmarks.
April 10th, 2024
When Nick and I started Nucleus Cloud Corp, our vision was to help developers build faster, more secure and resilient applications without having to be experts in security, data privacy, infrastructure, observability, the list goes on. Today, developers have to think about so many different things in order to develop and deploy a production-ready application that it's a miracle anything gets done. We built the Nucleus Cloud Platform to solve these problems and get developers back to what they love doing, writing code. We spent a year on the Nucleus Cloud Platform and were proud of the work that we had done. But at the end of the day, it was clear to us that cloud infrastructure is a tough game for startups. So we made the decision to work on other ideas.
When we started thinking of other ideas, the first place we went was to think about what our customers had asked for. One thing came up consistently. We had several of our customers ask us if we could help them test their services. And from an infrastructure standpoint, it was pretty straightforward. In Kubernetes, create multiple version of a deployment in different environments and test. But the complexity was in the data layer. How do we replicate data across different environments so that developers could consistently test their services locally but also in their CI pipeline. On top of that, a lot of the data was sensitive, so how do we protect user data as we do this? This got us thinking.
At the same time, something amazing was happening in the tech industry. AI/ML had exploded. ChatGPT became a household name and it seemed like there was a burning need for AI/ML tooling. Nick and I had both worked on ML platforms in the past and were familiar with the workflows and problems that arose when engineers are training and testing models. One of the biggest problems is having enough high-quality data to be able to train a model. Compute is cheap, data is not.
With this context, we started to think about the intersection of these two problems and came up with Neosync.
Today, developers and ML engineers don't have a way to generate high-quality synthetic data and sync it across environments. Neosync aims to solve this problem.
Neosync does two main things. First, it can generate high-quality synthetic data from scratch or anonymize existing data. This is extremely useful for developers who want to build and test their applications locally or in CI. Especially, if they're working with sensitive data. The status quo is that a developer will either manually PGDUMP
from their production database and PGRESTORE
locally or run a script that does the same thing. It's what almost every developer we've talked to does. That obviously needs to change. Not only is it ridiculously insecure but it's also wildly inefficient.
Almost no one we talked to had infrastructure around this workflow. How do you deal with versioning, access control, security, mismatches between local and CI? The list goes on. Neosync aims to solve these problems.
For ML engineers, they need access to high quality synthetic data to train and fine tune models. Neosync has the notion of a transformer. These are flexible modules that you can use to generate pretty much any kind of data. We're starting with the base data types i.e float64, int64, strings
etc and adding more defined data types such as first_name, last_name, address, ssn
and many more. We're shipping with 40+ transformers and are constantly adding more. If you're ever used the faker library then we're basically on par with that out of the gate.
In the near future, we're going to be releasing models that allow you to define the data you want and let the Neosync automatically generate it. Imagine being able to generate a million rows of synthetic data that you can use to fine-tune your model with just a prompt. You'll have full control over the distribution and shape of that data. You can then take that and fine tune your model seamlessly. That's the future we're building towards.
The second thing Neosync does is that it syncs data across environments. If you're anonymizing data from prod and want to use that locally, you need to first subset the data, because you likely only need a portion of it and then secondly sync it across stage, CI and dev. Neosync natively handles all of this with a workflow orchestration framework powered by Temporal. You can even sync it directly to a local DB using the Neosync CLI. For ML engineers, this can mean syncing data across environments from an S3 bucket or data lake. We made it dead simple and put it on a schedule.
It's clear that AI/ML is a major platform change and the best businesses are built ontop of platform changes. There is significant demand for high quality data for building applications and training models and we want Neosync to be at the forefront of that category. At the same time, data privacy and security have become more important than ever. The number and severity of data breaches continues to increase as hackers become more sophisticated with deep fake technology. This is why securing and protecting sensitive data is so important. All of this together makes now the perfect time for Neosync to enter the world.
As more developers and companies start to train and fine tune their own models, they're going to need a way to easily generate synthetic data. Especially, as data privacy laws become more stringent around the word, this need only increases.
There are two main reasons we think open source is the right model for Neosync. The first is that for most companies, they don't want their data to leave their infrastructure. Especially if that data is sensitive. When it's inside of your infrastructure, it's a lot easier to control access to the data and audit who interacted with it. Once that data leaves your infrastructure, it's anybody's game. Open source is a great way to get adoption from mid-size and enterprise sized companies who have long procurement cycles and stringent data privacy and security programs. A developer can fork your repo and run it locally in an hour if your project is open source versus if it's not, well, then you're in procurement hell for the next 6-9 months.
The second reason is that we fundamentally believe that when it comes to data, open source wins. Whether it's databases, ETL pipelines, big data applications or AI/ML models, open source is the best path forward. In fact, we didn't even really consider doing only a SaaS platform, it was pretty evident that we needed to be open source first. Does that mean that we will never launch a hosted version? Probably, not. There are still a lot of companies that don't or can't host something themselves. It might because they don't have the resources, skillset or time, but at the end of the day, they need a tool they can easily access and use. But for now, we're completely focused on open source.
And our promise is that we will always have an open source version of Neosync that you can use. That will never change.
Our vision is that high-quality data is accessible to every developer and ML engineer. Whether you're building an application or training a model, you should have access to the data you need in the format and shape you need it. Data should no longer be the bottleneck to great applications and models.
Build and evangelize. It's really that simple. We're going to be building a lot and talking about it a lot. There is a huge roadmap that we're really excited about and we can't wait to see teams using Neosync. Whether you're a developer building an application and you need high-quality synthetic data to test your application or an ML engineer fine-tuning the latest OSS model, we're building Neosync into something that can serve you. If you're interested in what we're working on and want to stay up to date or (even better) contribute, check out our github repo.
Until next time,
Evis & Nick
A walkthrough of how we reduced the time it takes to generate data in Neosync by 50% + benchmarks.
April 10th, 2024
A comparison between pg_dump and Neosync for Postgres data migrations
April 8th, 2024