התחל במצב לא מקוון עם האפליקציה Player FM !
פודקאסטים ששווה להאזין
בחסות


1 EP 570: ChatGPT’s Agent Mode Overview: 5 things you should know 32:40
#010 Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage
Manage episode 428522573 series 3585930
In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.
Summary by Section
Introduction
- Anjan Banerjee, a data architect, discusses building complex AI and data systems
- Explains the basics of data architecture using Lego and chat app examples
Sources and Tools
- Identifying data sources is the first step in designing a data architecture
- Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)
- Use one tool for most activities if possible, but specialized tools offer benefits
- Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)
Airflow and Orchestration
- Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills
- For less technical orgs, GUI-based tools like Talend, Alteryx may be better
- AWS Step Functions and managed Airflow are improving native orchestration capabilities
- For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte
AI and Data Processing
- ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud
- TinyML and edge computing enable ML inference on device (drones, manufacturing)
- Cloud batch processing still dominates for user targeting, recommendations
Data Lakes and Storage
- Storage choice depends on data types, use cases, cloud ecosystem
- Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata
- Pulling data into separate system often needed for advanced analytics beyond source system
Data Quality and Standardization
- "Poka-yoke" error-proofing of input screens is vital for downstream data quality
- Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion
- Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization
Hot Takes and Wishes
- Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.
- Automated data set joining and entity resolution across systems would be a game-changer
Anjan Banerjee:
Nicolay Gerold:
00:00 Understanding Data Architecture
12:36 Choosing the Right Tools
20:36 The Benefits of Serverless Functions
21:34 Integrating AI in Data Acquisition
24:31 The Trend Towards Single Node Engines
26:51 Choosing the Right Database Management System and Storage
29:45 Adding Additional Storage Components
32:35 Reducing Human Errors for Better Data Quality
39:07 Overhyped and Underutilized Tools
Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution
59 פרקים
Manage episode 428522573 series 3585930
In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.
Summary by Section
Introduction
- Anjan Banerjee, a data architect, discusses building complex AI and data systems
- Explains the basics of data architecture using Lego and chat app examples
Sources and Tools
- Identifying data sources is the first step in designing a data architecture
- Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)
- Use one tool for most activities if possible, but specialized tools offer benefits
- Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)
Airflow and Orchestration
- Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills
- For less technical orgs, GUI-based tools like Talend, Alteryx may be better
- AWS Step Functions and managed Airflow are improving native orchestration capabilities
- For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte
AI and Data Processing
- ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud
- TinyML and edge computing enable ML inference on device (drones, manufacturing)
- Cloud batch processing still dominates for user targeting, recommendations
Data Lakes and Storage
- Storage choice depends on data types, use cases, cloud ecosystem
- Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata
- Pulling data into separate system often needed for advanced analytics beyond source system
Data Quality and Standardization
- "Poka-yoke" error-proofing of input screens is vital for downstream data quality
- Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion
- Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization
Hot Takes and Wishes
- Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.
- Automated data set joining and entity resolution across systems would be a game-changer
Anjan Banerjee:
Nicolay Gerold:
00:00 Understanding Data Architecture
12:36 Choosing the Right Tools
20:36 The Benefits of Serverless Functions
21:34 Integrating AI in Data Acquisition
24:31 The Trend Towards Single Node Engines
26:51 Choosing the Right Database Management System and Storage
29:45 Adding Additional Storage Components
32:35 Reducing Human Errors for Better Data Quality
39:07 Overhyped and Underutilized Tools
Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution
59 פרקים
כל הפרקים
×
1 #052 Don't Build Models, Build Systems That Build Models 59:22

1 #051 Build systems that can be debugged at 4am by tired humans with no context 1:05:51

1 #050 Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 1:06:57

1 #050 TAKEAWAYS Bringing LLMs to Production: Delete Frameworks, Avoid Finetuning, Ship Faster 11:00

1 #049 BAML: The Programming Language That Turns LLMs into Predictable Functions 1:02:38

1 #049 TAKEAWAYS BAML: The Programming Language That Turns LLMs into Predictable Functions 1:12:34

1 #048 Why Your AI Agents Need Permission to Act, Not Just Read 57:02

1 #047 Architecting Information for Search, Humans, and Artificial Intelligence 57:21

1 #046 Building a Search Database From First Principles 53:28

1 #045 RAG As Two Things - Prompt Engineering and Search 1:02:43

1 #044 Graphs Aren't Just For Specialists Anymore 1:03:34

1 #043 Knowledge Graphs Won't Fix Bad Data 1:10:58

1 #042 Temporal RAG, Embracing Time for Smarter, Reliable Knowledge Graphs 1:33:43

1 #041 Context Engineering, How Knowledge Graphs Help LLMs Reason 1:33:34

1 #040 Vector Database Quantization, Product, Binary, and Scalar 52:11

1 #039 Local-First Search, How to Push Search To End-Devices 53:08

1 #038 AI-Powered Search, Context Is King, But Your RAG System Ignores Two-Thirds of It 1:14:23

1 #037 Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces 49:12

1 #036 How AI Can Start Teaching Itself - Synthetic Data Deep Dive 48:10

1 #035 A Search System That Learns As You Use It (Agentic RAG) 45:29

1 #034 Rethinking Search Inside Postgres, From Lexemes to BM25 47:15

1 #033 RAG's Biggest Problems & How to Fix It (ft. Synthetic Data) 51:25

1 #032 Improving Documentation Quality for RAG Systems 46:36

1 #031 BM25 As The Workhorse Of Search; Vectors Are Its Visionary Cousin 54:04

1 #030 Vector Search at Scale, Why One Size Doesn't Fit All 36:25

1 #029 Search Systems at Scale, Avoiding Local Maxima and Other Engineering Lessons 54:46

1 #028 Training Multi-Modal AI, Inside the Jina CLIP Embedding Model 49:21

1 #027 Building the database for AI, Multi-modal AI, Multi-modal Storage 44:53

1 #026 Embedding Numbers, Categories, Locations, Images, Text, and The World 46:43

1 #025 Data Models to Remove Ambiguity from AI and Search 58:39

1 #024 How ColPali is Changing Information Retrieval 54:56

1 #023 The Power of Rerankers in Modern Search 42:28

1 #022 The Limits of Embeddings, Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It) 46:05

1 #021 The Problems You Will Encounter With RAG At Scale And How To Prevent (or fix) Them 50:08

1 #020 The Evolution of Search, Finding Search Signals, GenAI Augmented Retrieval 52:15

1 #019 Data-driven Search Optimization, Analysing Relevance 51:13

1 #018 Query Understanding: Doing The Work Before The Query Hits The Database 53:01


1 #017 Unlocking Value from Unstructured Data, Real-World Applications of Generative AI 36:27

1 #016 Data Processing for AI, Integrating AI into Data Pipelines, Spark 46:25

1 #015 Building AI Agents for the Enterprise, Agent Cost Controls, Seamless UX 35:11

1 #014 Building Predictable Agents through Prompting, Compression, and Memory Strategies 32:13

1 Data Integration and Ingestion for AI & LLMs, Architecting Data Flows | changelog 3 14:52

1 #013 ETL for LLMs, Integrating and Normalizing Unstructured Data 36:47

1 #012 Serverless Data Orchestration, AI in the Data Stack, AI Pipelines 28:05

1 #011 Mastering Vector Databases, Product & Binary Quantization, Multi-Vector Search 40:05

1 #010 Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage 45:32

1 #009 Modern Data Infrastructure for Analytics and AI, Lakehouses, Open Source Data Stack 27:52

1 #008 Knowledge Graphs for Better RAG, Virtual Entities, Hybrid Data Models 36:39

1 #007 Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture 38:11

1 #006 Data Orchestration Tools, Choosing the right one for your needs 32:36

1 #005 Building Reliable LLM Applications, Production-Ready RAG, Data-Driven Evals 29:39

1 Lance v2: Rethinking Columnar Storage for Faster Lookups, Nulls, and Flexible Encodings | changelog 2 21:32

1 #004 AI with Supabase, Postgres Configuration, Real-Time Processing, and more 31:56

1 #003 AI Inside Your Database, Real-Time AI, Declarative ML/AI 36:03

1 Supabase acquires OrioleDB, A New Database Engine for PostgreSQL | changelog 1 13:36

1 #002 AI Powered Data Transformation, Combining gen & trad AI, Semantic Validation 37:08

1 #001 Multimodal AI, Storing 1 Billion Vectors, Building Data Infrastructure at LanceDB 34:03
ברוכים הבאים אל Player FM!
Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.