DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • The Power of AI: Building a Robust Data Ecosystem for Enterprise Success
  • Geo-Zoning Through Driving Distance Using K-Medoids Algorithm
  • The Strategy for Building Generative AI Applications
  • LLM: Trust, but Verify

Trending

  • C4 PlantUML: Effortless Software Documentation
  • Code Complexity in Practice
  • The Impact of Biometric Authentication on User Privacy and the Role of Blockchain in Preserving Secure Data
  • Spring Boot 3.2: Replace Your RestTemplate With RestClient
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. AI Shouldn’t Waste Time Reinventing ETL

AI Shouldn’t Waste Time Reinventing ETL

The AI community is reinventing data integration, but the current ETL platform is already solving this problem. Here's why they shouldn't reinvent it.

By 
John Lafleur user avatar
John Lafleur
·
Aug. 28, 23 · Opinion
Like (3)
Save
Tweet
Share
2.9K Views

Join the DZone community and get the full member experience.

Join For Free

The recent progress in AI is very exciting. People are using it in all sorts of novel ways, from improving customer support experiences and writing and running code to making new music and even accelerating medical imaging technology. 

But in the process, a worrying trend has emerged: the AI community seems to be reinventing the data movement (aka ELT). Whether they call them connectors, extractors, integrations, document loaders, or something else, people are writing the same code to extract data out of the same APIs, document formats, and databases and then load them into vector DBs or indices for their LLMs. 

The problem is that building and maintaining robust extraction and loading pipelines from scratch is a huge commitment. And there’s so much prior art in that area that for almost all engineers or companies in the AI space, it’s a huge waste of time to rebuild it. In a space where breaking news emerges approximately every hour, the main focus should be on making your core product incredible for your users, not going on side quests. And for almost everyone, the core product is not data movement; it’s the AI-powered magic sauce you’re brewing. 

A lot has been written about the challenges involved in building robust Extract, Transform, and Load (ETL) pipelines, but let’s contextualize it within AI.

Why Does AI Need Data Movement?

LLMs trained on public data are great, but you know what’s even better? AIs that can answer questions specific to us, our companies, and our users. We’d all love it if ChatGPT could learn our entire company wiki, read all of our emails, Slack messages, meeting notes, and transcripts, plug into our company’s analytics environment, and use all of these sources when answering our questions. Or, when integrating AI into our own product (for example, with Notion AI), we'd want our app’s AI model to know all the information we have about a user when helping them.

Data movement is a prerequisite for all that. 

Whether you’re fine-tuning a model or using Retrieval-Augmented Generation (RAG), you need to extract data from wherever it lives, transform it into a format digestible by your model, and then load it into a datastore your AI app can access to serve your use case. 

RAG

The diagram above illustrates what this looks like when using RAG, but you can imagine that even if you’re not using RAG, the basic steps are unlikely to change: you need to Extract, Transform, and Load aka ETL, the data in order to build AI models which know non-public information specific to you and your use case. 

Why Is Data Movement Hard?

Building a basic functional MVP for data extraction from an API or database is usually – though not always – doable on quick (<1 week) notice. The really hard part is making it production-ready and keeping it that way. Let’s look at some of the standard challenges that come to mind when building extraction pipelines.

Incremental Extracts and State Management

If you have any meaningful data volume, you’ll need to implement incremental extraction such that your pipeline only extracts the data it hasn’t seen before. To do this, you’ll need to have a persistence layer to keep track of what data each connection extracted. 

Transient Error Handling, Backoffs, Resume-On-Failure(s), Air Gapping

Upstream data sources all the time, sometimes without any clear reason. Your pipelines need to be resilient to this and retry with the right backoff policies. If the failures are not-so-transient (but still not your fault), then your pipeline needs to be resilient enough to remember where it left off and resume from the same place once upstream is fixed. Sometimes, the problem coming from upstream is severe enough (like an API dropping some crucial fields from records) that you want to pause the whole pipeline altogether until you examine what’s happening and manually decide what to do. 

Identifying and Proactively Fixing Configuration Errors

Suppose you’re building data extraction pipelines to extract your customers’ data. In that case, you’ll need to implement some defensive checks to ensure that all the configuration your customers gave you to extract data on their behalf is correct, and if they’re not, quickly give them actionable error messages. Most APIs do not make this easy because they don’t publish comprehensive error tables, and even when they do, they rarely give you endpoints that you can use to check the permissions assigned to, e.g., API tokens, so you have to find ways to balance comprehensive checks with quick feedback for the user. 

Authentication and Secret Management

APIs range in simplicity from simple bearer token auth to “creative” implementations of session tokens or single-use-token OAuth. You’ll need to implement the logic to perform the auth as well as manage the secrets, which may be getting refreshed once an hour, potentially coordinating secret refreshes across multiple concurrent workers.

Optimizing Extract and Load Speeds, Concurrency, and Rate Limits

And speaking of concurrent workers, you’ll likely want to implement concurrency to achieve high throughput for your extractions. While this may not matter on small datasets, it’s absolutely crucial on larger ones. Even though APIs publish official rate limits, you’ll need to empirically figure out the best parallelism parameters for maxing out the rate limit provided to you by the API without getting IP blacklisted or forever-rate-limited. 

Adapting to Upstream API Changes

APIs change and take on new undocumented behaviors or quirks all the time. Many vendors publish new API versions quarterly. You’ll need to keep an eye on how all these updates may impact your work and devote engineering time to keep it all up to date. New endpoints come up all the time, and some change their behavior (and you don’t always get a heads-up). 

Scheduling, Monitoring, Logging, and Observability

Beyond the code that extracts data from specific APIs, you’ll also likely need to build some horizontal capabilities leveraged by all of your data extractors. You’ll want some scheduling as well as logging and monitoring for when the scheduling doesn’t work or when other things go wrong, and you need to go investigate. You also likely want some observability, such as how many records were extracted yesterday, today, last week, etc., and which API endpoints or database tables they come from. 

Data Blocking or Hashing

Depending on where you’re pulling data from, you may need some privacy features for either blocking or hashing columns before sending them downstream. 

To be clear, the above does not apply if you just want to move a few files as a one-time thing. 

But it does apply when you’re building products that require data movement. Sooner or later, you’ll need to deal with most of these concerns. And while no single one of them is impossible rocket science, taken together, they can quickly add up to one or multiple full-time jobs, more so the more data sources you’re pulling from. 

And that’s exactly the difficulty with maintaining data extraction and pipelines: the majority of its cost comes from the continuous incremental investment needed to keep those pipelines functional and robust. For most AI engineers, that’s just not the job that adds the most value to their users. Their time is best spent elsewhere. 

So, What Does an AI Engineer Have To Do To Move Some Data Around Here?

If you ever find yourself in need of data extraction and loading pipelines, try the solutions already available instead of automatically building your own. Chances are they can solve a lot, if not all, of your concerns. If not, build your own as a last resort. 

And even if existing platforms don’t support everything you need, you should still be able to get most of the way there with a portable and extensible framework. This way, instead of building everything from scratch, you can get 90% of the way there with off-the-shelf features in the platform and only build and maintain the last 10%. The most common example is long-tail integrations: if the platform doesn’t ship with an integration to an API you need, then a good platform will make it easy to write some code or even a no-code solution to build that integration and still get all the useful features offered by the platform. Even if you want the flexibility of just importing a connector as a Python package and triggering it however you like from your code, you can use one of the many open-source EL tools like Airbyte or Singer connectors.

To be clear, data movement is not completely solved. There are situations where existing solutions genuinely fall short, and you need to build novel solutions. However, this is not the majority of the AI engineering population. Most people don’t need to rebuild the same integrations with Jira, Confluence, Slack, Notion, Gmail, Salesforce, etc., over and over again. Let’s just use the solutions that have already been battle-tested and made open for anyone to use so we can get on with adding the value our users actually care about. 

AI API Extract, transform, load Use case Data (computing) Pipeline (software)

Published at DZone with permission of John Lafleur. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • The Power of AI: Building a Robust Data Ecosystem for Enterprise Success
  • Geo-Zoning Through Driving Distance Using K-Medoids Algorithm
  • The Strategy for Building Generative AI Applications
  • LLM: Trust, but Verify

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: