Artwork

תוכן מסופק על ידי Tobias Macey. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Tobias Macey או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.
Player FM - אפליקציית פודקאסט
התחל במצב לא מקוון עם האפליקציה Player FM !

Unleash The Power Of Dataframes At Any Scale With Modin

38:54
 
שתפו
 

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on July 07, 2024 06:12 (1+ y ago)

What now? This series will be checked again in the next day. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Manage episode 298031968 series 1297742
תוכן מסופק על ידי Tobias Macey. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Tobias Macey או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.

Summary

When you start working on a data project there are always a variety of unknown factors that you have to explore. One of those is the volume of total data that you will eventually need to handle, and the speed and scale at which it will need to be processed. If you optimize for scale too early then it adds a high barrier to entry due to the complexities of distributed systems, but if you invest in a lot of engineering up front then it can be challenging to refactor for scale. Modin is a project that aims to remove that decision by letting you seamlessly replace your existing Pandas code and scale across CPU cores or across a cluster of machines. In this episode Devin Petersohn explains why he started working on solving this problem, how Modin is architected to allow for a smooth escalation from small to large volumes of data and compute, and how you can start using it today to accelerate your Pandas workflows.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Your host as usual is Tobias Macey and today I’m interviewing Devin Petersohn about Modin, a Pandas compatible dataframe library for datasets from 1MB to 1TB+

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you describe what Modin is and the story behind it?
    • Why study dataframes?
  • How do dataframes compare to databases?
    • What can you do in a dataframe that you couldn’t in a database?
  • What are your overall goals for the Modin project?
  • Who are the target users of Modin and how does that influence your prioritization of features?
  • What are some of the API inconsistencies that you have had to abstract and work around between Pandas, Ray, and Dask to give users a seamless experience?
  • What are some of the considerations in terms of capabilities or user experience that will influence whether to use Ray or Dask as the execution engine?
  • Can you describe how Modin is implemented?
    • How has the constraint of replicating the Pandas API influenced your architectural choices?
    • What are the most complex or challenging Pandas APIs to replicate in Modin?
  • In addition to the core Pandas API you have also added experimental features such as SQL support and a spreadsheet interface. How have those capabilities affected the range of potential use cases and end users?
  • What are some of the complexities that come from acting as a middleware between the Pandas API and the Ray and Dask frameworks?
  • What are some of the initial ideas or assumptions that you had about the design or utility of Modin that have been challenged as you worked through building and releasing it?
  • What are the most interesting, innovative, or unexpected ways that you have seen Modin used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Modin?
  • When is Modin the wrong choice?
  • What do you have planned for the future of Modin?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

  continue reading

394 פרקים

Artwork
iconשתפו
 

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on July 07, 2024 06:12 (1+ y ago)

What now? This series will be checked again in the next day. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Manage episode 298031968 series 1297742
תוכן מסופק על ידי Tobias Macey. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Tobias Macey או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.

Summary

When you start working on a data project there are always a variety of unknown factors that you have to explore. One of those is the volume of total data that you will eventually need to handle, and the speed and scale at which it will need to be processed. If you optimize for scale too early then it adds a high barrier to entry due to the complexities of distributed systems, but if you invest in a lot of engineering up front then it can be challenging to refactor for scale. Modin is a project that aims to remove that decision by letting you seamlessly replace your existing Pandas code and scale across CPU cores or across a cluster of machines. In this episode Devin Petersohn explains why he started working on solving this problem, how Modin is architected to allow for a smooth escalation from small to large volumes of data and compute, and how you can start using it today to accelerate your Pandas workflows.

Announcements

  • Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
  • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Your host as usual is Tobias Macey and today I’m interviewing Devin Petersohn about Modin, a Pandas compatible dataframe library for datasets from 1MB to 1TB+

Interview

  • Introductions
  • How did you get introduced to Python?
  • Can you describe what Modin is and the story behind it?
    • Why study dataframes?
  • How do dataframes compare to databases?
    • What can you do in a dataframe that you couldn’t in a database?
  • What are your overall goals for the Modin project?
  • Who are the target users of Modin and how does that influence your prioritization of features?
  • What are some of the API inconsistencies that you have had to abstract and work around between Pandas, Ray, and Dask to give users a seamless experience?
  • What are some of the considerations in terms of capabilities or user experience that will influence whether to use Ray or Dask as the execution engine?
  • Can you describe how Modin is implemented?
    • How has the constraint of replicating the Pandas API influenced your architectural choices?
    • What are the most complex or challenging Pandas APIs to replicate in Modin?
  • In addition to the core Pandas API you have also added experimental features such as SQL support and a spreadsheet interface. How have those capabilities affected the range of potential use cases and end users?
  • What are some of the complexities that come from acting as a middleware between the Pandas API and the Ray and Dask frameworks?
  • What are some of the initial ideas or assumptions that you had about the design or utility of Modin that have been challenged as you worked through building and releasing it?
  • What are the most interesting, innovative, or unexpected ways that you have seen Modin used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Modin?
  • When is Modin the wrong choice?
  • What do you have planned for the future of Modin?

Keep In Touch

Picks

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

  continue reading

394 פרקים

כל הפרקים

×
 
Loading …

ברוכים הבאים אל Player FM!

Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.

 

מדריך עזר מהיר

האזן לתוכנית הזו בזמן שאתה חוקר
הפעלה