Artwork

Player FM - Internet Radio Done Right

22 subscribers

Checked 2M ago
הוסף לפני four שנים
תוכן מסופק על ידי Utsav Shah. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Utsav Shah או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.
Player FM - אפליקציית פודקאסט
התחל במצב לא מקוון עם האפליקציה Player FM !

Software at Scale 54 - Community Trust with Vikas Agarwal

40:48
 
שתפו
 

Manage episode 354165657 series 2899471
תוכן מסופק על ידי Utsav Shah. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Utsav Shah או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.

Vikas Agarwal is an engineering leader with over twenty years of experience leading engineering teams. We focused this episode on his experience as the Head of Community Trust at Amazon and dealing with the various challenges of fake reviews on Amazon products.

Apple Podcasts | Spotify | Google Podcasts

Highlights (GPT-3 generated)

[0:00:17] Vikas Agarwal's origin story.

[0:00:52] How Vikas learned to code.

[0:03:24] Vikas's first job out of college.

[0:04:30] Vikas' experience with the review business and community trust.

[0:06:10] Mission of the community trust team.

[0:07:14] How to start off with a problem.

[0:09:30] Different flavors of review abuse.

[0:10:15] The program for gift cards and fake reviews.

[0:12:10] Google search and FinTech.

[0:14:00] Fraud and ML models.

[0:15:51] Other things to consider when it comes to trust.

[0:17:42] Ryan Reynolds' funny review on his product.

[0:18:10] Reddit-like problems.

[0:21:03] Activism filters.

[0:23:03] Elon Musk's changing policy.

[0:23:59] False positives and appeals process.

[0:28:29] Stress levels and question mark emails from Jeff Bezos.

[0:30:32] Jeff Bezos' mathematical skills.

[0:31:45] Amazon's closed loop auditing process.

[0:32:24] Amazon's success and leadership principles.

[0:33:35] Operationalizing appeals at scale.

[0:35:45] Data science, metrics, and hackathons.

[0:37:14] Developer experience and iterating changes.

[0:37:52] Advice for tackling a problem of this scale.

[0:39:19] Striving for trust and external validation.

[0:40:01] Amazon's efforts to combat abuse.

[0:40:32] Conclusion.


This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
  continue reading

60 פרקים

Artwork
iconשתפו
 
Manage episode 354165657 series 2899471
תוכן מסופק על ידי Utsav Shah. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Utsav Shah או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.

Vikas Agarwal is an engineering leader with over twenty years of experience leading engineering teams. We focused this episode on his experience as the Head of Community Trust at Amazon and dealing with the various challenges of fake reviews on Amazon products.

Apple Podcasts | Spotify | Google Podcasts

Highlights (GPT-3 generated)

[0:00:17] Vikas Agarwal's origin story.

[0:00:52] How Vikas learned to code.

[0:03:24] Vikas's first job out of college.

[0:04:30] Vikas' experience with the review business and community trust.

[0:06:10] Mission of the community trust team.

[0:07:14] How to start off with a problem.

[0:09:30] Different flavors of review abuse.

[0:10:15] The program for gift cards and fake reviews.

[0:12:10] Google search and FinTech.

[0:14:00] Fraud and ML models.

[0:15:51] Other things to consider when it comes to trust.

[0:17:42] Ryan Reynolds' funny review on his product.

[0:18:10] Reddit-like problems.

[0:21:03] Activism filters.

[0:23:03] Elon Musk's changing policy.

[0:23:59] False positives and appeals process.

[0:28:29] Stress levels and question mark emails from Jeff Bezos.

[0:30:32] Jeff Bezos' mathematical skills.

[0:31:45] Amazon's closed loop auditing process.

[0:32:24] Amazon's success and leadership principles.

[0:33:35] Operationalizing appeals at scale.

[0:35:45] Data science, metrics, and hackathons.

[0:37:14] Developer experience and iterating changes.

[0:37:52] Advice for tackling a problem of this scale.

[0:39:19] Striving for trust and external validation.

[0:40:01] Amazon's efforts to combat abuse.

[0:40:32] Conclusion.


This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
  continue reading

60 פרקים

כל הפרקים

×
 
Aravind was a Staff Software Engineer at Uber, and currently works at OpenAI. Apple Podcasts | Spotify | Google Podcasts Edited Transcript Can you tell us about the scale of data Uber was dealing with when you joined in 2018, and how it evolved? When I joined Uber in mid-2018, we were handling a few petabytes of data. The company was going through a significant scaling journey, both in terms of launching in new cities and the corresponding increase in data volume. By the time I left, our data had grown to over an exabyte. To put it in perspective, the amount of data grew by a factor of about 20 in just a three to four-year period. Currently, Uber ingests roughly a petabyte of data daily. This includes some replication, but it's still an enormous amount. About 60-70% of this is raw data, coming directly from online systems or message buses. The rest is derived data sets and model data sets built on top of the raw data. That's an incredible amount of data. What kinds of insights and decisions does this enable for Uber? This scale of data enables a wide range of complex analytics and data-driven decisions. For instance, we can analyze how many concurrent trips we're handling throughout the year globally. This is crucial for determining how many workers and CPUs we need running at any given time to serve trips worldwide. We can also identify trends like the fastest growing cities or seasonal patterns in traffic. The vast amount of historical data allows us to make more accurate predictions and spot long-term trends that might not be visible in shorter time frames. Another key use is identifying anomalous user patterns. For example, we can detect potentially fraudulent activities like a single user account logging in from multiple locations across the globe. We can also analyze user behavior patterns, such as which cities have higher rates of trip cancellations compared to completed trips. These insights don't just inform day-to-day operations; they can lead to key product decisions. For instance, by plotting heat maps of trip coordinates over a year, we could see overlapping patterns that eventually led to the concept of Uber Pool. How does Uber manage real-time versus batch data processing, and what are the trade-offs? We use both offline (batch) and online (real-time) data processing systems, each optimized for different use cases. For real-time analytics, we use tools like Apache Pinot. These systems are optimized for low latency and quick response times, which is crucial for certain applications. For example, our restaurant manager system uses Pinot to provide near-real-time insights. Data flows from the serving stack to Kafka, then to Pinot, where it can be queried quickly. This allows for rapid decision-making based on very recent data. On the other hand, our offline flow uses the Hadoop stack for batch processing. This is where we store and process the bulk of our historical data. It's optimized for throughput – processing large amounts of data over time. The trade-off is that real-time systems are generally 10 to 100 times more expensive than batch systems. They require careful tuning of indexes and partitioning to work efficiently. However, they enable us to answer queries in milliseconds or seconds, whereas batch jobs might take minutes or hours. The choice between batch and real-time depends on the specific use case. We always ask ourselves: Does this really need to be real-time, or can it be done in batch? The answer to this question goes a long way in deciding which approach to use and in building maintainable systems. What challenges come with maintaining such large-scale data systems, especially as they mature? As data systems mature, we face a range of challenges beyond just handling the growing volume of data. One major challenge is the need for additional tools and systems to manage the complexity. For instance, we needed to build tools for data discovery. When you have thousands of tables and hundreds of users, you need a way for people to find the right data for their needs. We built a tool called Data Book at Uber to solve this problem. Governance and compliance are also huge challenges. When you're dealing with sensitive customer data, you need robust systems to enforce data retention policies and handle data deletion requests. This is particularly challenging in a distributed system where data might be replicated across multiple tables and derived data sets. We built an in-house lineage system to track which workloads derive from what data. This is crucial for tasks like deleting specific data across the entire system. It's not just about deleting from one table – you need to track down and update all derived data sets as well. Data deletion itself is a complex process. Because most files in the batch world are kept immutable for efficiency, deleting data often means rewriting entire files. We have to batch these operations and perform them carefully to maintain system performance. Cost optimization is an ongoing challenge. We're constantly looking for ways to make our systems more efficient, whether that's by optimizing our storage formats, improving our query performance, or finding better ways to manage our compute resources. How do you see the future of data infrastructure evolving, especially with recent AI advancements? The rise of AI and particularly generative AI is opening up new dimensions in data infrastructure. One area we're seeing a lot of activity in is vector databases and semantic search capabilities. Traditional keyword-based search is being supplemented or replaced by embedding-based semantic search, which requires new types of databases and indexing strategies. We're also seeing increased demand for real-time processing. As AI models become more integrated into production systems, there's a need to handle more GPUs in the serving flow, which presents its own set of challenges. Another interesting trend is the convergence of traditional data analytics with AI workloads. We're starting to see use cases where people want to perform complex queries that involve both structured data analytics and AI model inference. Overall, I think we're moving towards more integrated, real-time, and AI-aware data infrastructure. The challenge will be balancing the need for advanced capabilities with concerns around cost, efficiency, and maintainability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Nora is the CEO and co-founder of Jeli , an incident management platform. Apple Podcasts | Spotify | Google Podcasts Nora provides an in-depth look into incident management within the software industry and discusses the incident management platform Jeli. Nora's fascination with risk and its influence on human behavior stems from her early career in hardware and her involvement with a home security company. These experiences revealed the high stakes associated with software failures, uncovering the importance of learning from incidents and fostering a blame-aware culture that prioritizes continuous improvement. In contrast to the traditional blameless approach, which seeks to eliminate blame entirely, a blame-aware culture acknowledges that mistakes happen and focuses on learning from them instead of assigning blame. This approach encourages open discussions about incidents, creating a sense of safety and driving superior long-term outcomes. We also discuss chaos engineering - the practice of deliberately creating turbulent conditions in production to simulate real-world scenarios. This approach allows teams to experiment and acquire the necessary skills to effectively respond to incidents. Nora then introduces Jeli, an incident management platform that places a high priority on the human aspects of incidents. Unlike other platforms that solely concentrate on technology, Jeli aims to bridge the gap between technology and people. By emphasizing coordination, communication, and learning, Jeli helps organizations reduce incident costs and cultivate a healthier incident management culture. We discuss how customer expectations in the software industry have evolved over time, with users becoming increasingly intolerant of low reliability, particularly in critical services (Dan Luu has an incredible blog on the incidence of bugs in day-to-day software). This shift in priorities has compelled organizations to place greater importance on reliability and invest in incident management practices. We conclude by discussing how incident management will further evolve and how leaders can set their organizations up for success. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Abi Noda is the CEO and co-founder of DX , a developer productivity platform. Apple Podcasts | Spotify | Google Podcasts My view on developer experience and productivity measurement aligns extremely closely with DX’s view. The productivity of a group of engineers cannot be measured by tools alone - there’s too many qualitative factors like cross-functional stakeholder beuracracy or inefficiency, and inherent domain/codebase complexity that cannot be measured by tools. At the same time, there are some metrics, like whether an engineer has committed any code-changes in their first week/month, that serve as useful guardrails for engineering leadership. A combination of tools and metrics may provide the holistic view and insights into the engineering organization’s throughput. In this episode, we discuss the DX platform, and Abi’s recently published research paper on developer experience . We talk about how organizations can use tools and surveys to iterate and improve upon developer experience, and ultimately, engineering throughput. GPT-4 generated summary In this episode, Abi Noda and I explore the landscape of engineering metrics and a quantifiable approach towards developer experience. Our discussion goes from the value of developer surveys and system-based metrics to the tangible ways in which DX is innovating the field. We initiate our conversation with a comparison of developer surveys and system-based metrics. Abi explains that while developer surveys offer a qualitative perspective on tool efficacy and user sentiment, system-based metrics present a quantitative analysis of productivity and code quality. The discussion then moves to the real-world applications of these metrics, with Pfizer and eBay as case studies. Pfizer, for example, uses a model where they employ metrics for a detailed understanding of developer needs, subsequently driving strategic decision-making processes. They have used these metrics to identify bottlenecks in their development cycle, and strategically address these pain points. eBay, on the other hand, uses the insights from developer sentiment surveys to design tools that directly enhance developer satisfaction and productivity. Next, our dialogue around survey development centered on the dilemma between standardization and customization. While standardization offers cost efficiency and benchmarking opportunities, customization acknowledges the unique nature of every organization. Abi proposes a blend of both to cater to different aspects of developer sentiment and productivity metrics. The highlight of the conversation was the introduction of DX's innovative data platform. The platform consolidates data across internal and third-party tools in a ready-to-analyze format, giving users the freedom to build their queries, reports, and metrics. The ability to combine survey and system data allows the unearthing of unique insights, marking a distinctive advantage of DX's approach. In this episode, Abi Noda shares enlightening perspectives on engineering metrics and the role they play in shaping the developer experience. We delve into how DX's unique approach to data aggregation and its potential applications can lead organizations toward more data-driven and effective decision-making processes. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Robert Cooke is the CTO and co-founder of 3Forge , a real-time data visualization platform. Apple Podcasts | Spotify | Google Podcasts In this episode, we delve into Wall Street's high-frequency trading evolution and the importance of high-volume trading data observability. We examine traditional software observability tools, such as Datadog, and contrast them with 3Forge’s financial observability platform, AMI. GPT-4 generated summary In this episode of the Software at Scale podcast, Robert Cooke, CTO and Co-founder of 3Forge, a comprehensive internal tools platform, shares his journey and insights. He outlines his career trajectory, which includes prominent positions such as the Infrastructure Lead at Bear Stearns and the Head of Infrastructure at Liquidnet, and his work on high-frequency trading systems that employ software and hardware to perform rapid, automated trading decisions based on market data. Cooke elucidates how 3Forge empowers subject matter experts to automate trading decisions by encoding business logic. He underscores the criticality of robust monitoring systems around these automated trading systems, drawing an analogy with nuclear reactors due to the potential catastrophic repercussions of any malfunction. The dialogue then shifts to the impact of significant events like the COVID-19 pandemic on high-frequency trading systems. Cooke postulates that these systems can falter under such conditions, as they are designed to follow developer-encoded instructions and lack the flexibility to adjust to unforeseen macro events. He refers to past instances like the Facebook IPO and Knight Capital's downfall, where automated trading systems were unable to handle atypical market conditions, highlighting the necessity for human intervention in such scenarios. Cooke then delves into how 3Forge designs software for mission-critical scenarios, making an analogy with military strategy. Utilizing the OODA loop concept - Observe, Orient, Decide, and Act, they can swiftly respond to situations like outages. He argues that traditional observability tools only address the first step, whereas their solution facilitates quick orientation and decision-making, substantially reducing reaction time. He cites a scenario involving a sudden surge in Facebook orders where their tool allows operators to detect the problem in real time, comprehend the context, decide on the response, and promptly act on it. He extends this example to situations like government incidents or emergencies where an expedited response is paramount. Additionally, Cooke emphasizes the significance of low latency UI updates in their tool. He explains that their software uses an online programming approach, reacting to changes in real-time and only updating the altered components. As data size increases and reaction time becomes more critical, this feature becomes increasingly important. Cooke concludes this segment by discussing the evolution of their clients' use cases, from initially needing static data overviews to progressively demanding real-time information and interactive workflows. He gives the example of users being able to comment on a chart and that comment being immediately visible to others, akin to the real-time collaboration features in tools like Google Docs. In the subsequent segment, Cooke shares his perspective on choosing the right technology to drive business decisions. He stresses the importance of understanding the history and trends of technology, having experienced several shifts in the tech industry since his early software writing days in the 1980s. He projects that while computer speeds might plateau, parallel computing will proliferate, leading to CPUs with more cores. He also predicts continued growth in memory, both in terms of RAM and disk space. He further elucidates his preference for web-based applications due to their security and absence of installation requirements. He underscores the necessity of minimizing the data in the web browser and shares how they have built every component from scratch to achieve this. Their components are designed to handle as much data as possible, constantly pulling in data based on user interaction. He also emphasizes the importance of constructing a high-performing component library that integrates seamlessly with different components, providing a consistent user experience. He asserts that developers often face confusion when required to amalgamate different components since these components tend to behave differently. He envisions a future where software development involves no JavaScript or HTML, a concept that he acknowledges may be unsettling to some developers. Using the example of a dropdown menu, Cooke explains how a component initially designed for a small amount of data might eventually need to handle much larger data sets. He emphasizes the need to design components to handle the maximum possible data from the outset to avoid such issues. The conversation then pivots to the concept of over-engineering. Cooke argues that building a robust and universal solution from the start is not over-engineering but an efficient approach. He notes the significant overlap in applications use cases, making it advantageous to create a component that can cater to a wide variety of needs. In response to the host's query about selling software to Wall Street, Cooke advocates targeting the most demanding customers first. He believes that if a product can satisfy such customers, it's easier to sell to others. They argue that it's challenging to start with a simple product and then scale it up for more complex use cases, but it's feasible to start with a complex product and tailor it for simpler use cases. Cooke further describes their process of creating a software product. Their strategy was to focus on core components, striving to make them as efficient and effective as possible. This involved investing years on foundational elements like string libraries and data marshalling. After establishing a robust foundation, they could then layer on additional features and enhancements. This approach allowed them to produce a mature and capable product eventually. They also underscore the inevitability of users pushing software to its limits, regardless of its optimization. Thus, they argue for creating software that is as fast as possible right from the start. They refer to an interview with Steve Jobs, who argued that the best developers can create software that's substantially faster than others. Cooke's team continually seeks ways to refine and improve the efficiency of their platform. Next, the discussion shifts to team composition and the necessary attributes for software engineers. Cooke emphasizes the importance of a strong work ethic and a passion for crafting good software. He explains how his ambition to become the best software developer from a young age has shaped his company's culture, fostering a virtuous cycle of hard work and dedication among his team. The host then emphasizes the importance of engineers working on high-quality products, suggesting that problems and bugs can sap energy and demotivate a team. Cooke concurs, comparing the experience of working on high-quality software to working on an F1 race car, and how the pursuit of refinement and optimization is a dream for engineers. The conversation then turns to the importance of having a team with diverse thought processes and skillsets. Cooke recounts how the introduction of different disciplines and perspectives in 2019 profoundly transformed his company. The dialogue then transitions to the state of software solutions before the introduction of their high-quality software, touching upon the compartmentalized nature of systems in large corporations and the problems that arise from it. Cooke explains how their solution offers a more comprehensive and holistic overview that cuts across different risk categories. Finally, in response to the host's question about open-source systems, Cooke expresses reservations about the use of open-source software in a corporate setting. However, he acknowledges the extensive overlap and redundancy among the many new systems being developed. Although he does not identify any specific groundbreaking technology, he believes the rapid proliferation of similar technologies might lead to considerable technical debt in the future. Host Utsav wraps up the conversation by asking Cooke about his expectations and concerns for the future of technology and the industry. Cooke voices his concern about the continually growing number of different systems and technologies that companies are adopting, which makes integrating and orchestrating all these components a challenge. He advises companies to exercise caution when adopting multiple technologies simultaneously. However, Cooke also expresses enthusiasm about the future of 3Forge, a platform he has devoted a decade of his life to developing. He expresses confidence in the unique approach and discipline employed in building the platform. Cooke is optimistic about the company's growth and marketing efforts and their focus on fostering a developer community. He believes that the platform will thrive as developers share their experiences, and the product gains momentum. Utsav acknowledges the excitement and potential challenges that lie ahead, especially in managing community-driven systems. They conclude the conversation by inviting Cooke to return for another discussion in the future to review the progression and evolution of the topic. Both express their appreciation for the fruitful discussion before ending the podcast. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Roi Rav-Hon is the co-founder and CEO of Finout, a SaaS cost management platform. Apple Podcasts | Spotify | Google Podcasts In this episode, we review the challenge of maintaining reasonable SaaS costs for tech companies. Usage-based pricing models of infrastructure costs lead to a gradual ramp-up of costs and always have sneakily come up as a priority in my career as an infrastructure/platform engineer. So I’m particularly interested in how engineering teams can better understand, track, and “shift left” infrastructure cost tracking and prevent regressions. We specifically go over Kubernetes cost management, and why cost management needs to be attributable to the most specific teams in order to be self-governing in an organization. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Ben Ofiri is the CEO and Co-Founder of Komodor, a Kubernetes troubleshooting platform. Apple Podcasts | Spotify | Google Podcasts We had an episode with the other founder of Komodor, Itiel, in 2021, and I thought it would be fun to revisit the topic. Highlights (ChatGPT Generated) [0:00] Introduction to the Software At Scale podcast and the guest speaker, Ben Ofiri, CEO and co-founder of Komodor. - Discussion of why Ben decided to work on a Kubernetes platform and the potential impact of Kubernetes becoming the standard for managing microservices. - Reasons why companies are interested in adopting Kubernetes, including the ability to scale quickly and cost-effectively, and the enterprise-ready features it offers. - The different ways companies migrate to Kubernetes, either starting from a small team and gradually increasing usage, or a strategic decision from the top down. - The flexibility of Kubernetes is its strength, but it also comes with complexity that can lead to increased time spent on alerts and managing incidents. - The learning curve for developers to be able to efficiently troubleshoot and operate Kubernetes can be steep and is a concern for many organizations. [8:17] Tools for Managing Kubernetes. - The challenges that arise when trying to operate and manage Kubernetes. - DevOps and SRE teams become the bottleneck due to their expertise in managing Kubernetes, leading to frustration for other teams. - A report by the cloud native observability organization found that one out of five developers felt frustrated enough to want to quit their job due to friction between different teams. - Ben's idea for Komodor was to take the knowledge and expertise of the DevOps and SRE teams and democratize it to the entire organization. - The platform simplifies the operation, management, and troubleshooting aspects of Kubernetes for every engineer in the company, from junior developers to the head of engineering. - One of the most frustrating issues for customers is identifying which teams should care about which issues in Kubernetes, which Komodor helps solve with automated checks and reports that indicate whether the problem is an infrastructure or application issue, among other things. - Komodor provides suggestions for actions to take but leaves the decision-making and responsibility for taking the action to the users. - The platform allows users to track how many times they take an action and how useful it is, allowing for optimization over time. [8:17] Tools for Managing Kubernetes. [12:03] The Challenge of Balancing Standardization and Flexibility. - Kubernetes provides a lot of flexibility, but this can lead to fragmented infrastructure and inconsistent usage patterns. - Komodor aims to strike a balance between standardization and flexibility, allowing for best practices and guidelines to be established while still allowing for customization and unique needs. [16:14] Using Data to Improve Kubernetes Management. - The platform tracks user actions and the effectiveness of those actions to make suggestions and fine-tune recommendations over time. - The goal is to build a machine that knows what actions to take for almost all scenarios in Kubernetes, providing maximum benefit to customers. [20:40] Why Kubernetes Doesn't Include All Management Functionality. - Kubernetes is an open-source project with many different directions it can go in terms of adding functionality. - Reliability, observability, and operational functionality are typically provided by vendors or cloud providers and not organically from the Kubernetes community. - Different players in the ecosystem contribute different pieces to create a comprehensive experience for the end user. [25:05] Keeping Up with Kubernetes Development and Adoption. - How Komodor keeps up with Kubernetes development and adoption. - The team is data-driven and closely tracks user feedback and needs, as well as new developments and changes in the ecosystem. - The use and adoption of custom resources is a constantly evolving and rapidly changing area, requiring quick research and translation into product specs. - The company hires deeply technical people, including those with backgrounds in DevOps and SRE, to ensure a deep understanding of the complex problem they are trying to solve. [32:12] The Effects of the Economy on Komodor. - The effects of the economy pivot on Komodor. - Companiesmust be more cost-efficient, leading to increased interest in Kubernetes and tools like Komodor. - The pandemic has also highlighted the need for remote work and cloud-based infrastructure, further fueling demand. - Komodor has seen growth as a result of these factors and believes it is well-positioned for continued success. [36:17] The Future of Kubernetes and Komodor. - Kubernetes will continue to evolve and be adopted more widely by organizations of all sizes and industries. - The team is excited about the potential of rule engines and other tools to improve management and automation within Kubernetes. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Vikas Agarwal is an engineering leader with over twenty years of experience leading engineering teams. We focused this episode on his experience as the Head of Community Trust at Amazon and dealing with the various challenges of fake reviews on Amazon products. Apple Podcasts | Spotify | Google Podcasts Highlights (GPT-3 generated) [0:00:17] Vikas Agarwal's origin story. [0:00:52] How Vikas learned to code. [0:03:24] Vikas's first job out of college. [0:04:30] Vikas' experience with the review business and community trust. [0:06:10] Mission of the community trust team. [0:07:14] How to start off with a problem. [0:09:30] Different flavors of review abuse. [0:10:15] The program for gift cards and fake reviews. [0:12:10] Google search and FinTech. [0:14:00] Fraud and ML models. [0:15:51] Other things to consider when it comes to trust. [0:17:42] Ryan Reynolds' funny review on his product. [0:18:10] Reddit-like problems. [0:21:03] Activism filters. [0:23:03] Elon Musk's changing policy. [0:23:59] False positives and appeals process. [0:28:29] Stress levels and question mark emails from Jeff Bezos. [0:30:32] Jeff Bezos' mathematical skills. [0:31:45] Amazon's closed loop auditing process. [0:32:24] Amazon's success and leadership principles. [0:33:35] Operationalizing appeals at scale. [0:35:45] Data science, metrics, and hackathons. [0:37:14] Developer experience and iterating changes. [0:37:52] Advice for tackling a problem of this scale. [0:39:19] Striving for trust and external validation. [0:40:01] Amazon's efforts to combat abuse. [0:40:32] Conclusion. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Mike Bland is a software instigator - he helped drive adoption of automated testing at Google , and the Quality Culture Initiative at Apple . Apple Podcasts | Spotify | Google Podcasts Mike’s blog was instrumental towards my decision to pick a job in developer productivity/platform engineering. We talk about the Rainbow of Death - the idea of driving cultural change in large engineering organizations - one of the key challenges of platform engineering teams. And we deep dive into the value and common pushbacks against automated testing. Highlights (GPT-3 generated) [0:00 - 0:29] Welcome [0:29 - 0:38] Explanation of Rainbow of Death [0:38 - 0:52] Story of Testing Grouplet at Google [0:52 - 5:52] Benefits of Writing Blogs and Engineering Culture Change [5:52 - 6:48] Impact of Mike's Blog [6:48 - 7:45] Automated Testing at Scale [7:45 - 8:10] "I'm a Snowflake" Mentality [8:10 - 8:59] Instigator Theory and Crossing the Chasm Model [8:59 - 9:55] Discussion of Dependency Injection and Functional Decomposition [9:55 - 16:19] Discussion of Testing and Testable Code [16:19 - 24:30] Impact of Organizational and Cultural Change on Writing Tests [24:30 - 26:04] Instigator Theory [26:04 - 32:47] Strategies for Leaders to Foster and Support Testing [32:47 - 38:50] Role of Leadership in Promoting Testing [38:50 - 43:29] Philosophical Implications of Testing Practices This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Benjy Weinberger is the co-founder of Toolchain , a build tool platform. He is one of the creators of the original Pants, an in-house Twitter build system focused on Scala, and was the VP of Infrastructure at Foursquare. Toolchain now focuses on Pants 2 , a revamped build system. Apple Podcasts | Spotify | Google Podcasts In this episode, we go back to the basics, and discuss the technical details of scalable build systems, like Pants , Bazel and Buck . A common challenge with these build systems is that it is extremely hard to migrate to them, and have them interoperate with open source tools that are built differently. Benjy’s team redesigned Pants with an initial hyper-focus on Python to fix these shortcomings, in an attempt to create a third generation of build tools - one that easily interoperates with differently built packages, but still fast and scalable. Machine-generated Transcript [0:00] Hey, welcome to another episode of the Software at Scale podcast. Joining me here today is Benji Weinberger, previously a software engineer at Google and Twitter, VP of Infrastructure at Foursquare, and now the founder and CEO of Toolchain.Thank you for joining us.Thanks for having me. It's great to be here. Yes. Right from the beginning, I saw that you worked at Google in 2002, which is forever ago, like 20 years ago at this point.What was that experience like? What kind of change did you see as you worked there for a few years?[0:37] As you can imagine, it was absolutely fascinating. And I should mention that while I was at Google from 2002, but that was not my first job.I have been a software engineer for over 25 years. And so there were five years before that where I worked at a couple of companies.One was, and I was living in Israel at the time. So my first job out of college was at Check Point, which was a big successful network security company. And then I worked for a small startup.And then I moved to California and started working at Google. And so I had the experience that I think many people had in those days, and many people still do, of the work you're doing is fascinating, but the tools you're given to do it with as a software engineer are not great.This, I'd had five years of experience of sort of struggling with builds being slow, builds being flaky with everything requiring a lot of effort. There was almost a hazing,ritual quality to it. Like, this is what makes you a great software engineer is struggling through the mud and through the quicksand with this like awful substandard tooling. And,We are not users, we are not people for whom products are meant, right?We make products for other people. Then I got to Google.[2:03] And Google, when I joined, it was actually struggling with a very massive, very slow make file that took forever to parse, let alone run.But the difference was that I had not seen anywhere else was that Google paid a lot of attention to this problem and Google devoted a lot of resources to solving it.And Google was the first place I'd worked and I still I think in many ways the gold standard of developers are first class participants in the business and deserve the best products and the best tools and we will if there's nothing out there for them to use, we will build it in house and we will put a lot of energy into that.And so it was for me, specifically as an engineer.[2:53] A big part of watching that growth from in the sort of early to late 2000s was. The growth of engineering process and best practices and the tools to enforce it and the thing i personally am passionate about is building ci but i'm also talking about.Code review tools and all the tooling around source code management and revision control and just everything to do with engineering process.It really was an object lesson and so very, very fascinating and really inspired a big chunk of the rest of my career.I've heard all sorts of things like Python scripts that had to generate make files and finally they move the Python to your first version of Blaze. So it's like, it's a fascinating history.[3:48] Maybe can you tell us one example of something that was like paradigm changing that you saw, like something that created like a magnitude, like order of magnitude difference,in your experience there and maybe your first aha moment on this is how good like developer tools can be?[4:09] Sure. I think I had been used to using make basically up till that point. And Google again was, as you mentioned, using make and really squeezing everything it was possible to squeeze out of that lemon and then some.[4:25] But when the very early versions of what became blaze which was that big internal build system which inspired basil which is the open source variant of that today. Hey one thing that really struck me was the integration with the revision controls system which was and i think still is performance.I imagine many listeners are very familiar with Git. Perforce is very different. I can only partly remember all of the intricacies of it, because it's been so long since I've used it.But one interesting aspect of it was you could do partial checkouts. It really was designed for giant code bases.There was this concept of partial checkouts where you could check out just the bits of the code that you needed. But of course, then the question is, how do you know what those bits are?But of course the build system knows because the build system knows about dependencies. And so there was this integration, this back and forth between the, um.[5:32] Perforce client and the build system that was very creative and very effective.And allowed you to only have locally on your machine, the code that you actually needed to work on the piece of the codebase you're working on,basically the files you cared about and all of their transitive dependencies. And that to me was a very creative solution to a problem that involved some lateral thinking about how,seemingly completely unrelated parts of the tool chain could interact. And that's kind of been that made me realize, oh, there's a lot of creative thought at work here and I love it.[6:17] Yeah, no, I think that makes sense. Like I interned there way back in 2016. And I was just fascinated by, I remember by mistake, I ran like a grep across the code base and it just took forever. And that's when I realized, you know, none of this stuff is local.First of all, like half the source code is not even checked out to my machine.And my poor grep command is trying to check that out. But also how seamlessly it would work most of the times behind the scenes.Did you have any experience or did you start working on developer tools then? Or is that just what inspired you towards thinking about developer tools?I did not work on the developer tools at Google. worked on ads and search and sort of Google products, but I was a big user of the developer tools.Exception which was that I made some contributions to the.[7:21] Protocol buffer compiler which i think many people may be familiar with and that is. You know if i very deep part of the toolchain that is very integrated into everything there and so that gave me.Some experience with what it's like to hack on a tool that's everyone in every engineer is using and it's the sort of very deep part of their workflow.But it wasn't until after google when i went to twitter.[7:56] I noticed that the in my time of google my is there the rest of the industry had not. What's up and suddenly i was sort of stressed ten years into the past and was back to using very slow very clunky flaky.Tools that were not designed for the tasks we were trying to use them for. And so that made me realize, wait a minute, I spent eight years using these great tools.They don't exist outside of these giant companies. I mean, I sort of assumed that maybe, you know, Microsoft and Amazon and some other giants probably have similar internal tools, but there's something out there for everyone else.And so that's when I started hacking on that problem more directly was at Twitter together with John, who is now my co-founder at Toolchain, who was actually ahead of me and ahead ofthe game at Twitter and already begun working on some solutions and I joined him in that.Could you maybe describe some of the problems you ran into? Like were the bills just taking forever or was there something else?[9:09] So there were...[9:13] A big part of the problem was that Twitter at the time, the codebase I was interested in and that John was interested in was using Scala. Scala is a fascinating, very rich language.[9:30] Its compiler is very slow. And we were in a situation where, you know, you'd make some small change to a file and then builds would take just,10 minutes, 20 minutes, 40 minutes. The iteration time on your desktop was incredibly slow.And then CI times, where there was CI in place, were also incredibly slow because of this huge amount of repetitive or near repetitive work. And this is because the build tools,etc. were pretty naive about understanding what work actually needs to be done given a set of changes.There's been a ton of work specifically on SBT since then.[10:22] It has incremental compilation and things like that, but nonetheless, that still doesn't really scale well to large corporate codebases that are what people often refer to as monorepos.If you don't want to fragment your codebase with all of the immense problems that that brings, you end up needing tooling that can handle that situation.Some of the biggest challenges are, how do I do less than recompile the entire codebase every time. How can tooling help me be smart about what is the correct minimal amount of work to do.[11:05] To make compiling and testing as fast as it can be?[11:12] And I should mention that I dabbled in this problem at Twitter with John. It was when I went to Foursquare that I really got into it because Foursquare similarly had this big Scala codebase with a very similar problem of incredibly slow builds.[11:29] The interim solution there was to just upgrade everybody's laptops with more RAM and try and brute force the problem. It was very obvious to everyone there, tons of,force-creation pattern still has lots of very, very smart engineers.And it was very obvious to them that this was not a permanent solution and we were casting around for...[11:54] You know what can be smart about scala builds and i remember this thing that i had hacked on twitter and. I reached out to twitter and ask them to open source it so we could use it and collaborate on it wasn't obviously some secret sauce and that is how the very first version of the pants open source build system came to be.I was very much designed around scarlet did eventually.Support other languages. And we hacked on it a lot at Foursquare to get it to...[12:32] To get the codebase into a state where we could build it sensibly. So the one big challenge is build speed, build performance.The other big one is managing dependencies, keeping your codebase sane as it scales.Everything to do with How can I audit internal dependencies?How do I make sure that it is very, very easy to accidentally create all sorts of dependency tangles and cycles and create a code base whose dependency structure is unintelligible, really,hard to work with and actually impacts performance negatively, right?If you have a big tangle of dependencies, you're more likely to invalidate a large chunk of your code base with a small change.And so tooling that allows you to reason about the dependencies in your code base and.[13:24] Make it more tractable was the other big problem that we were trying to solve. Mm-hmm. No, I think that makes sense.I'm guessing you already have a good understanding of other build systems like Bazel and Buck.Maybe could you walk us through what are the difference for PANs, Veevan? What is the major design differences? And even maybe before that, like, how was Pants designed?And is it something similar to like creating a dependency graph? You need to explicitly include your dependencies.Is there something else that's going on?[14:07] Maybe just a primer. Yeah. Absolutely. So I should mention, I was careful to mention, you mentioned Pants V1.The version of Pants that we use today and base our entire technology stack around is what we very unimaginatively call Pants V2, which we launched two years ago almost to the day.That is radically different from Pants V1, from Buck, from Bazel. It is quite a departure in ways that we can talk about later.One thing that I would say Panacea V1 and Buck and Bazel have in common is that they were designed around the use cases of a single organization. is a.[14:56] Open source variant or inspired by blaze its design was very much inspired by. Here's how google does engineering and a buck similarly for facebook and pansy one frankly very similar for.[15:11] Twitter and we sort of because Foursquare also contributed a lot to it, we sort of nudged it in that direction quite a bit. But it's still very much if you did engineering in this one company's specific image, then this might be a good tool for you.But you had to be very much in that lane.But what these systems all look like is, and the way they are different from much earlier systems is.[15:46] They're designed to work in large scalable code bases that have many moving parts and share a lot of code and that builds a lot of different deployables, different, say, binaries or DockerDocker images or AWS lambdas or cloud functions or whatever it is you're deploying, Python distributions, Java files, whatever it is you're building, typically you have many of them in this code base.Could be lots of microservices, could be just lots of different things that you're deploying.And they live in the same repo because you want that unity. You want to be able to share code easily. you don't want to introduce dependency hell problems in your own code. It's bad enough that we have dependency hell problems third-party code.[16:34] And so these systems are all if you squint at them from thirty thousand feet today all very similar in that they make that the problem of. Managing and building and testing and packaging in a code base like that much more tractable and the way they do this is by applying information about the dependencies in your code base.So the important ingredient there is that these systems understand the find the relatively fine grained dependencies in your code base.And they can use that information to reason about work that needs to happen. So a naive build system, you'd say, run all the tests in the repo or in this part of the repo.So a naive system would literally just do that, and first they would compile all the code.[17:23] But a scalable build system like these would say, well, you've asked me to run these tests, but some of them have already been cached and these others, okay, haven't.So I need to look at these ones I actually need to run. So let me see what needs to be done before I can run them.Oh, so these source files need to be compiled, but some of those already in cache and then these other ones I need to compile. But I can apply concurrency because there are multiple cores on this machine.So I can know through dependency analysis which compile jobs can run concurrently and which cannot. And then when it actually comes time to run the tests, again, I can apply that sort of concurrency logic.[18:03] And so these systems, what they have in common is that they use dependency information to make your building testing packaging more tractable in a large code base.They allow you to not have to do the thing that unfortunately many organizations find themselves doing, which is fragmenting the code base into lots of different bits andsaying, well, every little team or sub team works in its own code base and they consume each other's code through, um, so it was third party dependencies in which case you are introducing a dependency versioning hell problem.Yeah. And I think that's also what I've seen that makes the migration to a tool like this hard. Cause if you have an existing code base that doesn't lay out dependencies explicitly.[18:56] That migration becomes challenging. If you already have an import cycle, for example.[19:01] Bazel is not going to work with you. You need to clean that up or you need to create one large target where the benefits of using a tool like Bazel just goes away. And I think that's a key,bit, which is so fascinating because it's the same thing over several years. And I'm hoping that,it sounds like newer tools like Go, at least, they force you to not have circular dependencies and they force you to keep your code base clean so that it's easy to migrate to like a scalable build system.[19:33] Yes exactly so it's funny that is the exact observation that let us to pans to see to so they said pans to be one like base like buck was very much inspired by and developed for the needs of a single company and other companies were using it a little bit.But it also suffered from any of the problems you just mentioned with pans to for the first time by this time i left for square and i started to chain with the exact mission of every company every team of any size should have this kind of tooling should have this ability this revolutionary ability to make the code base is fast and tractable at any scale.And that made me realize.We have to design for that we have to design for not for. What a single company's code base looks like but we have to design.To support thousands of code bases of all sorts of different challenges and sizes and shapes and languages and frameworks so.We actually had to sit down and figure out what does it mean to make a tool.Like this assistant like this adoptable over and over again thousands of times you mentioned.[20:48] Correctly, that it is very hard to adopt one of those earlier tools because you have to first make your codebase conform to whatever it is that tool expects, and then you have to write huge amounts of manual metadata to describe all of the dependencies in your,the structure and dependencies of your codebase in these so-called build files.If anyone ever sees this written down, it's usually build with all capital letters, like it's yelling at you and that those files typically are huge and contain a huge amount of information your.[21:27] I'm describing your code base to the tool with pans be to eat very different approaches first of all we said this needs to handle code bases as they are so if you have circular dependencies it should handle them if you have. I'm going to handle them gracefully and automatically and if you have multiple conflicting external dependencies in different parts of your code base this is pretty common right like you need this version of whatever.Hadoop or NumPy or whatever it is in this part of the code base, and you have a different conflicting version in this other part of the code base, it should be able to handle that.If you have all sorts of dependency tangles and criss-crossing and all sorts of things that are unpleasant, and better not to have, but you have them, the tool should handle that.It should help you remove them if you want to, but it should not let those get in the way of adopting it.It needs to handle real-world code bases. The second thing is it should not require you to write all this crazy amount of metadata.And so with Panzer V2, we leaned in very hard on dependency inference, which means you don't write these crazy build files.You write like very tiny ones that just sort of say, you know, here is some code in this language for the build tool to pay attention to.[22:44] But you don't have to edit the added dependencies to them and edit them every time you change dependencies.Instead, the system infers dependencies by static analysis. So it looks at your, and it does this at runtime.So you, you know, almost all your dependencies, 99% of the time, the dependencies are obvious from import statements.[23:05] And there are occasional and you can obviously customize this because sometimes there are runtime dependencies that have to be inferred from like a string. So from a json file or whatever is so there are various ways to customize this and of course you can always override it manually.If you have to be generally speaking ninety.Seven percent of the boilerplate that used to going to build files in those old systems including pans v1 no. You know not claiming we did not make the same choice but we goes away with pans v2 for exactly the reason that you mentioned these tools,because they were designed to be adopted once by a captive audience that has no choice in the matter.And it was designed for how that code base that adopting code base already is. is these tools are very hard to adopt.They are massive, sometimes multi-year projects outside of that organization. And we wanted to build something that you could adopt in days to weeks and would be very easy,to customize to your code base and would not require these massive wholesale changes or huge amounts of metadata.And I think we've achieved that. Yeah, I've always wondered like, why couldn't constructing the build file be a part of the build. In many ways, I know it's expensive to do that every time. So just like.[24:28] Parts of the build that are expensive, you cache it and then you redo it when things change.And it sounds like you've done exactly that with BANs V2.[24:37] We have done exactly that. The results are cached on a profile basis. So the very first time you run something, then dependency inference can take some time. And we are looking at ways to to speed that up.I mean, like no software system has ever done, right? Like it's extremely rare to declare something finished. So we are obviously always looking at ways to speed things up.But yeah, we have done exactly what you mentioned. We don't, I should mention, we don't generate the dependencies into build for, we don't edit build files and then you check them in.We do that a little bit. So I mentioned you do still with PANSTL V2, you need these little tiny build files that just say, here is some code.They typically can literally be one line sometimes, almost like a marker file just to say, here is some code for you to pay attention to.We're even working on getting rid of those.We do have a little script that generates those one time just to help you onboard.But...[25:41] The dependencies really are just generated a runtime as on demand as needed and used a runtime so we don't have this problem of. Trying to automatically add or edit a otherwise human authored file that is then checked in like this generating and checking in files is.Problematic in many ways, especially when those files also have to take human written edits.So we just do away with all of that and the dependency inference is at runtime, on demand, as needed, sort of lazily done, and the information is cached. So both cached in memory in the surpassed V2 has this daemon that runs and caches a huge amount of state in memory.And the results of running dependency inference are also cached on disk. So they survive a daemon restart, etc.I think that makes sense to me. My next question is going to be around why would I want to use panthv2 for a smaller code base, right? Like, usually with the smaller codebase, I'm not running into a ton of problems around the build.[26:55] I guess, do you notice these inflection points that people run into? It's like, okay, my current build setup is not enough. What's the smallest codebase that you've seen that you think could benefit? Or is it like any codebase in the world? And I should start with,a better build system rather than just Python setup.py or whatever.I think the dividing line is, will this code base ever be used for more than one thing?[27:24] So if you have a, let's take the Python example, if literally all this code base will ever do is build this one distribution and a top level setup pie is all I need. And that is, you know, this,sometimes you see this with open source projects and the code base is going to remain relatively small, say it's only ever going to be a few thousand lines and the tests, even if I runthe tests from scratch every single time, it takes under five minutes, then you're probably fine.But I think two things I would look at are, am I going to be building multiple things in this code base in the future, or certainly if I'm doing it now.And that is much more common with corporate code bases. You have to ask yourself, okay, my team is growing, more and more people are cooperating on this code base.I want to be able to deploy multiple microservices. I want to be able to deploy multiple cloud functions.I want to be able to deploy multiple distributions or third-party artifacts.I want to be able to.[28:41] You know, multiple sort of data science jobs, whatever it is that you're building. If you want, if you ever think you might have more than one, now's the time to think about,okay, how do I structure the code base and what tooling allows me to do this effectively?And then the other thing to look at is build times. If you're using compiled languages, then obviously compilation, in all cases testing, if you start to see like, I can already see that that tests are taking five minutes, 10 minutes, 15 minutes, 20 minutes.Surely, I want some technology that allows me to speed that up through caching, through concurrency, through fine-grained invalidation, namely, don't even attempt to do work that isn't necessary for the result that was asked for.Then it's probably time to start thinking about tools like this, because the earlier you adopt it, the easier it is to adopt.So if you wait until you've got a tangle of multiple setup pies in the repo and it's unclear how you manage them and how you keep their dependencies synchronized,so there aren't version conflicts across these different projects, specifically with Python,this is an interesting problem.I would say with other languages, there is more because of the compilation step in jvm languages or go you.[30:10] Encounter the need for a build system much much earlier a bill system of some kind and then you will ask yourself what kind with python because you can get a bite for a while just running. What are the play gate and pie test and i directly and all everything is all together in a single virtual and.But the Python tooling, as mighty as it is, mostly is not designed for larger code bases with multiple, that deploy multiple things and have multiple different sets of.[30:52] Internal and external dependencies the tooling generally implicitly assume sort of one top level set up i want top level. Hi project dot com all you know how are you configuring things and so especially using python let's say for jango flask apps or for data scienceand your code base is growing and you've hired a bunch of data scientists and there's more and more code going in there. With Python, you need to start thinking about what tooling allows me to scale this code base. No, I think I mostly resonate with that. The first question that comes to my mind is,let's talk specifically about the deployment problem. If you're deployed to multiple AWS lambdas or cloud functions or whatever, the first thought that would come to my mind isis I can use separate Docker images that can let me easily produce this container image that I can ship independently.Would you say that's not enough? I totally get that for the build time problem.A Docker image is not going to solve anything. But how about the deployment step?[32:02] So again, with deployments, I think there are two ways where a tool like this can really speed things up.One is only build the things that actually need to be redeployed. And because the tool understands dependencies and can do change analysis, it can figure that out.So one of the things that HansB2 does is it integrates with Git.And so it natively understands how to figure out Git diffs. So you can say something like, show me all the whatever, lambdas, let's say, that are affected by changes between these two branches.[32:46] And it knows and it understands it can say, well, these files changed and you know, we, I understand the transitive dependencies of those files.So I can see what actually needs to be deployed. And, you know, many cases, many things will not need to be redeployed because they haven't changed.The other thing is there's a lot of performance improvements and process improvements around building those images. So, for example, we have for Python specifically, we have an executable format called PEX,which stands for Python executable, which is a single file that embeds all of your Python code that is needed for your deployable and all of its external requirements, transitive external requirements, all bundled up into this single sort of self-executing file.This allows you to do things like if you have to deploy 50 of these, you can basically have a single docker image.[33:52] The different then on top of that you add one layer for each of these fifty and the only difference in that layer is the presence of this pecs file. Where is without all this typically what you would do is.You have fifty docker images each one of which contains a in each one of which you have to build a virtual and which means running.[34:15] Pip as part of building the image, and that gets slow and repetitive, and you have to do it 50 times.We have a lot of ways to speed up. Even if you are deploying 50 different Docker images, we have ways of speeding that up quite dramatically.Because again, of things like dependency analysis, the PECS format, and the ability to build incrementally.Yeah, I think I remember that at Dropbox, we came up with our own, like, par format to basically bundle up a Python binary with, I think par stood for Python archive. I'm notentirely sure. But it did something remarkably similar to solve exactly this problem. It just takes so long, especially if you have a large Python code base. I think that makes sense to me. The other thing that one might ask is, with Python, you don't really have,too long of a build time, is what you would guess, because there's nothing to build. Maybe myPy takes some time to do some static analysis, and, of course, your tests can take forever,and you don't want to rerun them. But there isn't that much of a build time that you have to think about. Would you say that you agree with this, or there's some issues that end,up happening on real-world code basis.[35:37] Well that's a good question the word builds means different things to different people and we recently taken to using the time see i more. Because i think that is clear to people what that means but when i say build or see i mean it in the law in in the extended sense everything you do to go from.Human written source code to a verified.Test did. deployable artifact and so it's true that for python there's no compilation step although arguably. Running my pie is really important and now that i'm really in the habit of using.My pie i will probably never not use it on python code ever again but so that are.[36:28] Sort of build-ish steps for Python such as type checking, such as running code generators like Thrift or Protobuf.And obviously a big, big one is running, resolving third-party dependencies such as running PIP or poetry or whatever it is you're using. So those are all build steps.But with Python, really the big, big, big thing is testing and packaging and primarily testing.And so with Python, you have to be even more rigorous about unit testing than you do with other languages because you don't have a compiler that is catching whole classes of bugs.So and again, MyPy and type checking does really help with that. And so when I build to me includes, build in the large sense includes running tests,includes packaging and includes everything, all the quality control that you run typically in CI or on your desktop in order to go say, well, I've made some edits and here's the proof that these edits are good and I can merge them or deploy them.[37:35] I think that makes sense to me. And like, I certainly saw it with the limited number of testing, the limited amount of type checking you can do with Python, like MyPy is definitelyimproving on this. You just need to unit test a lot to get the same amount of confidence in your own code and then unit tests are not cheap. The biggest question that comes tomy mind is that is BANs V2 focused on Python? Because I have a TypeScript code base at my workplace and I would love to replace the TypeScript compiler with something that was slightly smarter and could tell me, you know what, you don't need to run every unit test every change.[38:16] Great question so when we launched a pass me to which was two years ago. The.We focused on python and that was the initial language we launched with because you had to start somewhere and in the city ten years in between the very scarlet centric work we were doing on pansy one. And the launch of hands be to something really major happened in the industry which was the python skyrocketed in popularity sky python went from.Mostly the little scripting language around the edges of your quote unquote real code, I can use python like fancy bash to people are building massive multi billion dollar businesses entirely on python code bases and there are a few things that drove this one was.I would say the biggest one probably was the python became the. Language of choice for data science and we have strong support for those use cases. There was another was the,Django and Flask became very popular for writing web apps more and more people were used there were more in Intricate DevOps use cases and Python is very popular for DevOps for various good reasons. So.[39:28] Python became super popular. So that was the first thing we supported in pants v2, but we've since added support for or Go, Java, Scala, Kotlin, Shell.Definitely what we don't have yet is JavaScript TypeScript. We are looking at that very closely right now, because that is the very obvious next thing we want to add.Actually, if any listeners have strong opinions about what that should look like, we would love to hear from them or from you on our Slack channels or on our GitHub discussions where we are having some lively discussions about exactly this because the JavaScript.[40:09] And TypeScript ecosystem is already very rich with tools and we want to provide only value add, right? We don't want to say, you have to, oh, you know, here's another paradigm you have to adopt.And here's, you know, you have to replace, you've just done replacing whatever this with this, you know, NPM with yarn. And now you have to do this thing. And now we're, we don't want to beanother flavor of the month. We only want to do the work that uses those tools, leverages the existing ecosystem but adds value. This is what we do with Python and this is one of the reasons why our Python support is very, very strong, much stronger than any other comparable tool out there is.[40:49] A lot of leaning in on the existing Python tool ecosystem but orchestrating them in a way that brings rigor and speed to your builds.And I haven't used the word we a lot. And I just kind of want to clarify who we is here.So there is tool chain, the company, and we're working on, um, uh, SAS and commercial, um, solutions around pants, which we can talk about in a bit.But there is a very robust open source community around pants that is not. tightly held by Toolchain, the company in a way that some other companies open source projects are.So we have a lot of contributors and maintainers on Pants V2 who are not working at Toolchain, but are using Pants in their own companies and their own organizations.And so we have a very wide range of use cases and opinions that are brought to bear. And this is very important because, as I mentioned earlier,we are not trying to design a system for one use case, for one company or a team's use case.We are trying, you know, we are working on a system we want.[42:05] Adoption for over and over and over again at a wide variety of companies. And so it's very important for us to have the contributions and the input from a wide variety of teams and companiesand people. And it's very fortunate that we now do. I mean, on that note, the thing that comes to my mind is another benefit of your scalable build system like Vance or Bazel or Buck is that youYou don't have to learn various different commands when you are spelunking through the code base, whether it's like a Go code base or like a Java code base or TypeScript code base.You just have to run pants build X, Y, Z, and it can construct the appropriate artifacts for you. At least that was my experience with Bazel.Is that something that you are interested in or is that something that pants V2 does kind of act as this meta layer for various other build systems or is it much more specific and knowledgeable about languages itself?[43:09] It's, I think your intuition is correct. The idea is we want you to be able to do something like pants test or pants test, you know, give it a path to a directory and it understands what that means.Oh, this directory contains Python code. Therefore, I should run PyTest in this way. And oh, Oh, it also contains some JavaScript code, so I should run the JavaScript test in this way.And it basically provides a conceptual layer above all the individual tools that gives you this uniformity across frameworks, across languages.One way to think about this is.[43:52] The tools are all very imperative. say you have to run them with a whole set of flags and inputs and you have to know how to use each one separately. So it's like having just the blades of a Swiss Army knife withno actual Swiss Army knife. A tool like Pants will say, okay, we will encapsulate all of that complexity into a much more simple command line interface. So you can do, like I said,test or pants lint or pants format and it understands, oh, you asked me to format your code. I see that you have the black and I sort configured as formatters. So I will run them. And I happen to know that formatting, because formatting can change the source files,I have to run them sequentially. But when you ask for lint, it's not changing the source files. So I know that I can run them multiple lint as concurrently, that sort of logic. And And different tools have different ways of being configured or of telling you what they want to do, but we...[44:58] Can't be to sort of encapsulate all that away from you and so you get this uniform simple command line interface that abstract away a lot of the specifics of these tools and let you run simple commands and the reason this is important is that. This extra layer of indirection is partly what allows pants to apply things like cashing.And invalidation and concurrency because what you're saying is.[45:25] Hey, the way to think about it is not, I am telling pants to run tests. It is I am telling pants that I want the results of tests, which is a subtle difference.But pants then has the ability to say, well, I don't actually need to run pi test on all these tests because I have results from some of them already cached. So I will return them from cache.So that layer of indirection not only simplifies the UI, but provides the point where you can apply things like caching and concurrency.Yeah, I think every programmer wants to work with declarative tools. I think SQL is one of those things where you don't have to know how the database works. If SQL were somewhat easier, that dream would be fulfilled. But I think we're all getting there.I guess the next question that I have is, what benefit do I get by using the tool chain, like SaaS product versus Pants V2?When I think about build systems, I think about local development, I think about CI.[46:29] Why would I want to use the SaaS product? That's a great question.So Pants does a huge amount of heavy lifting, but in the end it is restricted to the resources is on the machine on which it's running. So when I talk about cash, I'm talking about the local cash on that machine. When I talk about concurrency, I'm talking about using,the cores on your machine. So maybe your CI machine has four cores and your laptop has eight cores. So that's the amount of concurrency you get, which is not nothing at all, which is great.[47:04] Thanks for watching![47:04] You know as i mentioned i worked at google for many years and then other companies where distributed systems were saying like i come from a distributed systems background and it really. Here is a problem.All of a piece of work taking a long time because of. Single machine resource constraints the obvious answer here is distributed distributed the work user distributed system and so that's what tool chain offers essentially.[47:30] You configure Pants to point to the toolchain system, which is currently SAS.And we will have some news soon about some on-prem solutions.And now the cache that I mentioned is not just did this test run with these exact inputs before on my machine by me me while I was iterating, but has anyone in my organization or any CI run this test before,with these exact inputs?So imagine a very common situation where you come in in the morning and you pull all the changes that have happened since you last pulled.Those changes presumably passed CI, right? And the CI populated the cache.So now when I run tests, I can get cache hits from the CI machine.[48:29] Now pretty much, yeah. And then with concurrency, again, so let's say, you know, post cache, there are still 200 tests that need to be run.I could run them eight at a time on my machine or the CI machine could run them, you know, say, four at a time on four cores, or I could run 50 or 100 at a time on a cluster of machines.That's where, again, as your code base gets bigger and bigger, that's where some massive, massive speedups come in.The other aspects of the... I should mention that the remote execution that I just mentioned is something we're about to launch. It is not available today. The remote caching is.The other aspects are things like observability. So when you run builds on your laptop or CI, they're ephemeral.Like the output gets lost in the scroll back.And it's just a wall of text that gets lost with them.[49:39] Toolchain all of that information is captured and stored in structured form so you have. Hey the ability to see past bills and see build behavior over time and drill death search builds and drill down into individual builds and see well.How often does this test fail and you know when did this get slow all this kind of information and so you get.This more enterprise level.Observability into a very core piece of developer productivity, which is the iteration time.The time it takes to run tests and build deployables and parcel the quality control checks so that you can merge and deploy code directly relates to time to release.It directly relates to some of the core metrics of developer productivity. How long is it going to take to get this thing out the door?And so having the ability to both speed that up dramatically through distributing the work and having observability into what work is going on, that is what toolchain provides,on top of the already, if I may say, pretty robust open source offering.[51:01] So yeah, that's kind of it.[51:07] Pants on its own gives you a lot of advantages, but it runs standalone. Plugging it into a larger distributed system really unleashes the full power of Pants as a client to that system.[51:21] No, I think what I'm seeing is this interesting convergence. There's several companies trying to do this for Bazel, like BuildBuddy and Edgeflow. So, and it really sounds like the build system of the future, like 10 years from now.[51:36] No one will really be developing on their local machines anymore. Like there's GitHub code spaces on one side. It's like you're doing all your development remotely.[51:46] I've always found it somewhat odd that development that happens locally and whatever scripts you need to run to provision your CI machine to run the same set of testsare so different sometimes that you can never tell why something's passing locally and failing in in CI or vice versa. And there really should just be this one execution layer that can say, you know what, I'm going to build at a certain commit or run at a certain commit.And that's shared between the local user and the CI user. And your CI script is something as simple as pants build slash slash dot dot dot. And it builds the whole code base for,you. So yeah, I certainly feel like the industry is moving in that direction. I'm curious whether You think that's the same.Do you have an even stronger vision of how folks will be developing 10 years from now? What do you think it's going to look like?Oh, no, I think you're absolutely right. I think if anything, you're underselling it. I think this is how all development should be and will be in the future for multiple reasons.One is performance.[52:51] Two is the problem of different platforms. And so today, big thorny problem is I want to, you know, I want to,I'm developing on my Mac book, but the production, so I'm running, when I run tests locally and when I run anything locally, it's running on my Mac book, but that's not our deployable, right?Typically your deploy platform is some flavor of Linux. So...[53:17] With the distributed system approach you can run the work in. Containers that exactly match your production environments you don't even have to care about can this run.On will my test pass on mac os do i need ci the runs on mac os just to make sure the developers can. past test on Mac OS and that is somehow correlated with success on the production environment.You can cut away a whole suite of those problems, which today, frankly, I had mentioned earlier, you can get cache hits on your desktop from remote, from CI populating the cache.That is hampered by differences in platform.Is hampered by other differences in local setup that we are working to mitigate. But imagine a world in which build logic is not actually running on your MacBook, or if it is,it's running in a container that exactly matches the container that you're targeting.It cuts away a whole suite of problems around platform differences and allows you to focus because on just a platform you're actually going to deploy too.[54:35] And the...[54:42] And just the speed and the performance of being able to work and deploy and the visibility that it gives you into the productivity and the operational work of your development team,I really think this absolutely is the future.There is something very strange about how in the last 15 years or so, so many business functions have had the distributed systems treatment applied to them.Function is now that there are these massive valuable companies providing systems that support sales and systems that support marketing and systems that support HR and systems supportoperations and systems support product management and systems that support every business function,and there need to be more of these that support engineering as a business function.[55:48] And so i absolutely think the idea that i need a really powerful laptop so that my running tests can take thirty minutes instead of forty minutes when in reality it should take three minutes is. That's not the future right the future is to as it has been for so many other systems to the web the laptop is that i can take anywhere is.Particularly in these work from home times, is a work from anywhere times, is just a portal into the system that is doing the actual work.[56:27] Yeah. And there's all these improvements across the stack, right? When I see companies like Versel, they're like, what if you use Next.js, we provide the best developer platform forthat and we want to provide caching. Then there's like the lower level systems with build systems, of course, like bands and Bazel and all that. And at each layer, we're kindof trying to abstract the problem out. So to me, it still feels like there is a lot of innovation to be done. And I'm also going to be really curious to know, you know, there'sgoing to be like a few winners of this space, or if it's going to be pretty broken up. And like everyone's using different tools. It's going to be fascinating, like either way.Yeah, that's really hard to know. I think one thing you mentioned that I think is really important is you said your CI should be as simple as just pants build colon, colon, or whatever.That's our syntax would be sort of pants test lint or whatever.I think that's really important. So.[57:30] Today one of the big problems with see i. Which is still growing right now home market is still growing is more more teams realize the value and importance of automated.Very aggressive automated quality control. But configuring CI is really, really complicated. Every CI provider have their own configuration language,and you have to reason about caching, and you have to manually construct cache keys to the extent,that caching is even possible or useful.There's just a lot of figuring out how to configure and set up CI, And even then it's just doing the naive thing.[58:18] So there are a couple of interesting companies, Dagger and Earthly, or interesting technologies around simplifying that, but again, you still have to manually,so they are providing a, I would say, better config and more uniform config language that allows you to, for example, run build steps in containers.And that's not nothing at all.[58:43] Um, but you still manually creating a lot of, uh, configuration to run these very coarse grained large scale, long running build steps, you know, I thinkthe future is something like my entire CI config post cloning the repo is basically pants build colon, colon, because the system does the configuration for you.[59:09] It figures out what that means in a very fast, very fine grained way and does not require you to manually decide on workflows and steps and jobs and how they all fit together.And if I want to speed this thing up, then I have to manually partition the work somehow and write extra config to implement that partitioning.That is the future, I think, is rather than there's the CI layer, say, which would be the CI providers proprietary config or theodagger and then underneath that there is the buildtool, which would be Bazel or Pants V2 or whatever it is you're using, could still be we make for many companies today or Maven or Gradle or whatever, I really think the future is the integration of those two layers.In the same way that I referenced much, much earlier in our conversation, how one thing that stood out to me at Google was that they had the insight to integrate the version control layer and the build tool to provide really effective functionality there.I think the build tool being the thing that knows about your dependencies.[1:00:29] Can take over many of the jobs of the c i configuration layer in a really smart really fast. Where is the future where essentially more and more of how do i set up and configure and run c i is delegated to the thing that knows about your dependencies and knows about cashing and knows about concurrency and is able,to make smarter decisions than you can in a YAML config file.[1:01:02] Yeah, I'm excited for the time that me as a platform engineer has to spend less than 5% of my time thinking about CI and CD and I can focus on other things like improving our data models rather than mucking with the YAML and Terraform configs. Well, yeah.Yeah. Yeah. Today you have to, we're still a little bit in that state because we are engineers and because we, the tools that we use are themselves made out of software. There's,a strong impulse to tinker and there's a strong impulse sake. Well, I want to solve this problem myself or I want to hack on it or I should be able to hack on it. And that's, you should be able to hack on it for sure. But we do deserve more tooling that requires less hacking,and more things and paradigms that have tested and have survived a lot of tire kicking.[1:02:00] Will we always need to hack on them a little bit? Yes, absolutely, because of the nature of what we do. I think there's a lot of interesting things still to happen in this space.Yeah, I think we should end on that happy note as we go back to our day jobs mucking with YAML. Well, thanks so much for being a guest. I think this was a great conversation and I hope to have you again for the show sometime.Would love that. Thanks for having me. It was fascinating. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Puneet Gupta is the co-founder and CEO of Amberflo , a cloud metering and usage based pricing platform. Apple Podcasts | Spotify | Google Podcasts In this episode, we discuss Puneet’s fascinating background early at AWS as a GM and his early experience at Oracle Cloud. We initially discuss why AWS shipped S3 as its first product before any other services. After, we go over the cultural differences between AWS and Oracle, and how usage based pricing and sales tied into the organization’s culture and efficiency. Our episode covers all the different ways organizations align themselves better when pricing is directly tied to the usage metrics of customers. We discuss how SaaS subscription models are simply reworking of traditional software licenses, how vendors can dispel fears around overages due to dynamic pricing models, and even why Netflix should be a usage-based-priced service :-) We don’t have a show notes, but I thought it would be interesting to link the initial PR newsletter for S3’s launch , to reflect on how our industry has completely changed over the last few years. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Akshay Buddiga is the co-founder and CTO of Traba, a labor management platform. Apple Podcasts | Spotify | Google Podcasts Sorry for the long hiatus in episodes! Today’s episode covers a myriad of interesting topics - from being the star of one of the internet’s first viral videos , to experiencing the hyper-growth at the somewhat controversial Zenefits, scaling out the technology platform at Fanatics, starting a company, picking an accelerator, only permitting in-person work, facilitating career growth of gig workers, and more! Highlights [0:00] - The infamous Spelling Bee incident. [06:30] - Why pivot to Computer Science after an undergraduate focus in biomedical engineering? [09:30] - Going to Stanford for Management Science and getting an education in Computer Science. [13:00] - Zenefits during hyper-growth. Learning from Parker Conrad. [18:30] - Building an e-commerce platform with reasonably high scale (powering all NFL gear) as a first software engineering gig. Dealing with lots of constraints from the beginning - like multi-currency support - and delivering a complete solution over several years. The interesting seasonality - like Game 7 of the NBA finals - and the implications on the software engineers maintaining e-commerce systems. Watching all the super-bowls with coworkers. [26:00] - A large outage, obviously due to DNS routing. [31:00] - Why start a company? [37:30] - Why join OnDeck ? [41:00] - Contrary to the current trend, Traba only allows in-person work. Why is that? We go on to talk about the implications of remote work and other decisions in an early startup’s product velocity. [57:00] - On being competitive. [58:30] - Velocity is really about not working on the incorrect stuff. [68:00] - What’s next for Traba? What’s the vision? [72:30] - Building two-sided marketplaces, and the career path for gig workers. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
James Cowling is the co-founder of Convex , a state management platform for web developers. Apple Podcasts | Spotify | Google Podcasts We discuss the state of web development in the industry today, and the various different approaches to make it easier. Contrasting the Hasura and Convex approach as a good way to illustrate some of the ideas. Hasura lets you skip the web-app, and run queries against the database through GraphQL queries. Convex, on the other hand, helps you stop worrying about databases. No setup or scaling concerns. It’s interesting to see how various systems are evolving to help developers with reducing the busywork around more and more layers of the stack, and just focus on delivering business value instead. Convex also excels at the developer experience portion - they provide a deep integration with React, use hooks ( just like Apollo GraphQL ) and seem to have a fully typed (and therefore auto-completable) SDK. I expect more companies will move “up the stack” to provide deeper integrations with popular tools like React. Episode Reading List * The co-founders of this company led Dropbox’s Magic Pocket project. * Convex → Netlify * Convex vs. Firebase * Prisma This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Josh Twist is the co-founder and CEO of Zuplo , a programmable, developer friendly API Gateway Management Platform. Apple Podcasts | Spotify | Google Podcasts We discuss a new category of developer tools startups - API Gateway Management Platforms. We go over what an API Gateway is, why do companies use gateways, common pain-points in gateway management, building reliable systems that serve billions of requests at scale. But most importantly, we dive into the story of Josh’s UK Developer of the Year 2009 award. Recently, I’ve been working on the Vanta API and was surprised at how poor the performance and developer experience around Amazon’s API Gateway is. It has poor support for rate limiting, and has very high edge latency. So I’m excited for a new crop of companies to provide good solutions in this space. Episode Reading List * Amazon’s API Gateway * Stripe’s API - The first ten years * Envoy The Award This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Ted Young is the Director of Developer Education at Lightstep and a co-founder of the OpenTelemetry project. Apple Podcasts | Spotify | Google Podcasts This episode dives deep into the history of OpenTelemetry, why we need a new telemetry standard, all the work that goes into building generic telemetry processing infrastructure, and the vision for unified logging, metrics and traces. Episode Reading List Instead of highlights, I’ve attached links to some of our discussion points. * HTTP Trace Context - new headers to support a standard way to preserve state across HTTP requests. * OpenTelemetry Data Collection * Zipkin * OpenCensus and OpenTracing - the precursor projects to OpenTelemetry This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Or Weis is the CEO and founder of Permit.io , a Permission as a Service platform. Previously, he founded Rookout , a cloud-debugging tool. Apple Podcasts | Spotify | Google Podcasts Many of us have struggled (or are struggling) with permission management in the various applications we’ve built. The complexity of these systems always tends to increase through business requirements - for example, some content should only be accessed by paid users or users in a certain geography. Certain architectures like filesystems have hierarchical permissions that efficient evaluation, and there’s technical complexity that’s often unique to the specific application. We talk about all the complexity around permission management, and techniques to solve it in this episode. We also explore how Permit tries to solve this as a product and abstract this problem out for everyone. Highlights [0:00] - Why work on access control? [02:00] - Sources of complexity in permission management [08:00] - Which cloud system manages permissions well? [11:00] - Product-izing a solution to this problem [17:00] - What kind of companies approach you for solutions to this problem? [22:00] - Why are there research papers written about permission management? [38:00] - Permission management across the technology stack (inter-service communication) [42:00] - What are you excited about building next? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Jon Skeet is a Staff Developer Platform Engineer at Google, working on Google Cloud Platform client libraries for .NET. He's best known for contributions to Stack Overflow as well as his book, C# in Depth . Additionally he is the primary maintainer of the Noda Time date/time library for .NET. You may also be interested in Jon Skeet Facts . Apple Podcasts | Spotify | Google Podcasts We discuss the intricacies of timezones, how to attempt to store time correctly, how storing UTC is not a silver bullet , asynchronous help on the internet, the implications of new tools like GitHub Copilot, remote work, Jon’s upcoming book on software diagnostics, and more. Highlights [01:00] - What exactly is a Developer Platform Engineer? [05:00] - Why is date and time management so tricky? [13:00] - How should I store my timestamps? We discuss reservation systems, leap seconds, timezone changes, and more. [21:00] - StackOverflow, software development, and more. [27:00] - Software diagnostics [32:00] - The evolution of StackOverflow [34:00] - Remote work for software developers [41:00] - Github Copilot and the future of software development tools [44:00] - What’s your most controversial programming opinion? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Lee Byron is the co-creator of GraphQL , a senior engineering manager at Robinhood , and the executive director of the GraphQL foundation . Apple Podcasts | Spotify | Google Podcasts We discuss the GraphQL origin story, early technical decisions at Facebook, the experience of deploying GraphQL today, and the future of the project. Highlights (some tidbits) [01:00] - The origin story of GraphQL. Initially, the Facebook application was an HTML web-view wrapper. It seemed like the right choice at the time, with the iPhone releasing without an app-store, Steve Jobs calling it an “internet device”, and Android phones coming out soon after, with Chrome, a brand-new browser. But the application had horrendous performance, high crash rates, used up a lot of RAM on devices and animations would lock the phone up. Zuckerberg called the bet Facebook’s biggest mistake . The idea was to rebuild the app from scratch using native technologies. A team built up a prototype for the news feed, but they quickly realized that there weren’t any clean APIs to retrieve data in a palatable format for phones - the relevant APIs all returned HTML. But Facebook had a nice ORM-like library in PHP to access data quickly, and there was a parallel effort to speed up the application by using this library. There was another project to declaratively declare data requirements for this ORM for increased performance and a better developer experience. Another factor was that mobile data networks were pretty slow, and having a chatty REST API for the newsfeed would lead to extremely slow round-trip times and tens of seconds to load the newsfeed. So GraphQL started off as a little library that could make declarative calls to the PHP ORM library from external sources and was originally called SuperGraph. Finally, the last piece was to make this language strongly typed, from the lessons of other RPC frameworks like gRPC and Thrift. [16:00] So there weren’t any data-loaders or any such pieces at the time. GraphQL has generally been agnostic to how the data actually gets loaded, and there are plugins to manage things like quick data loading, authorization, etc. Also, Facebook didn’t need data-loading, since its internal ORM managed de-duplication, so it didn’t need to be built until there was sufficient external feedback. [28:00] - GraphQL for public APIs - what to keep in mind. Query costing, and other differences from REST. [42:00] - GraphQL as an open-source project [58:00] - The evolution of the language, new features that Lee is most excited about, like Client-side nullability . Client-side nullability is an interesting proposal - where clients can explicitly state how important retrieving a certain field is, and on the flip side, allow partial failures for fields that aren’t critical. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Harshyt Goel is a founding engineer and engineering manager of Platform and Integrations at Loom , a video-messaging tool for workplaces. He’s also an angel investor, so if you’re looking for startup advice, investments, hiring advice, or a software engineering job, please reach out to him on Twitter . Apple Podcasts | Spotify | Google Podcasts We discuss Loom’s story, from when it had six people and a completely different product, to the unicorn it is today. We focus on driving growth, complicated product launches, and successfully launching the Loom SDK. Highlights [00:30] - How it all began [03:00] - Who is a founding engineer? Coming from Facebook to a 5 person startup [06:00] - Company inflection points. [10:30] - Pricing & packaging iterations. [14:30] - Running growth for a freemium product, and the evolution of growth efforts at Loom [30:00] - Summing up the opportunities unlocked by a growth team [33:00] - Sometimes, reducing user friction isn’t what you want. [34:30] - The Loom SDK, from idea to launch. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Daniel Stenberg is the founder and lead developer of curl and libcurl. Apple Podcasts | Spotify | Google Podcasts This episode, along with others like this one , reminds me of this XKCD: We dive into all the complexity of transferring data across the internet. Highlights [00:30] - The complexity behind HTTP. What goes on behind the scenes when I make a web request? [11:30] - The organizational work behind internet-wide RFCs, like HTTP/3. [20:00] - Rust in curl. The developer experience, and the overall experience of integrating Hyper. [30:00] - Web socket support in curl [34:00] - Fostering an open-source community. [38:00] - People around the world think Daniel has hacked their system, because of the curl license often included in malicious tools. [41:00] - Does curl have a next big thing? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Sahil Lavingia is the founder of Gumroad , an e-commerce platform that helps you sell digital services. He also runs SHL Capital , a rolling fund for early-stage startups. Apple Podcasts | Spotify | Google Podcasts Sahil’s recent book, Minimal Entrepreneurship , explores a framework for building profitable, sustainable companies. I’ve often explored the trade-off between software engineering and trying to build and launch my own company, so this conversation takes up that theme and explores what it means to be a minimal entrepreneur for a software engineer. Highlights (edited) Utsav: Let’s talk about VCs (referencing your popular blog post “ Reflecting on My Failure to Build a Billion-Dollar Company ”). Are startups pushed to grow faster and faster due to VC dynamics, or is there something else going on behind the scenes? It’s a combination of things. People who get caught up in this anti-VC mentality are missing larger forces at play because I don't really think it's just VCs who are making all of these things happen. Firstly, there’s definitely a status game being played. When I first moved to the Bay Area, as soon as you mention you’re working on your own, the first question people ask you is how far along your company is, who you raised money with, how many employees you have, and comparing you with other people they know. You can’t really get too upset at that, since that’s the nature of the people coming to a boomtown like San Francisco. The way I think about it, there’s a high failure rate in being able to build a billion-dollar company, so you want to find out reasonably quickly whether you will succeed or not. Secondly, we’re in a very unique industry, where equity is basically the primary source of compensation. 90% of Americans don’t have some sort of equity component in the businesses they work for, but giving equity has a ton of benefits. It’s great to have that alignment, and folks who take an early risk for your company should get rewarded. The downside of equity is that it creates this very strong desire and incentive to make your company as valuable as possible, as quickly as possible. In order to get your equity to be considered valuable to investors, you need to grow quickly, so that investors use these models that project your growth rate to form your valuation. Many people took my blog to say - it’s the VC’s fault, but that’s not true. The VCs let me do what I wanted, they don’t really have that much power. The issue was that in order for employees to see a large outcome, you need the company to have a large exit. As a founder, you’d do pretty well if the company sold for $50 million dollars, but that’s not true for employees, they really need this thing to work, otherwise, the best ones can just go work for the next Stripe. So you have this winner-take-all behavior for employees, and it’s ultimately why I ended up shrinking the company to just me for a while. Utsav: So do you give employees equity in the minimalist entrepreneurship framework? Firstly: avoid hiring anyone else for as long as possible, until you know you have some kind of product-market fit. I think It depends on your liquidity strategy. How are you as a founder about to make money from this business? The way you incentivize your employees should align with that. If you want to sell your company for a hundred million dollars, consider sharing that and giving equity. If you plan to create a cash cow business, consider profit sharing. Utsav: What, if any, is the difference between indie-hacking and minimalist entrepreneurship? They’re pretty similar. Indie hacker seems like a personality, perhaps similar to a Digital Nomad, where the lifestyle seems to be the precedent. I went to MicroConf in Las Vegas, and the attendee’s goals were fairly consistent - to buy a nice house and spend more time with their family. In that case, your goal should be to build the most boring but profitable business possible, for a community you don’t particularly care about because your goals have nothing to do with serving that community, which is totally fine. No value judgments from me. With indie-hacking, it seems more geared around independence. I tried living the digital nomad life - work solo, travel the world, no schedule, but I didn’t actually enjoy it. It wasn’t really satisfying. I like working on a project with many people, and things improve, and I get to learn from others, they learn from me, I like talking to my customers, who I can talk to frequently, and their lives are getting better because of my work. I enjoy that. So I wanted a middle-ground between the “live on a beach” mentality and the blitzscaling, build the next Facebook mentality. I like to think that with things like crowdfunding, this will get more and more feasible. Even though my article went viral and the ideas often resonated, there’s this aspirational aspect to many humans - they want to build something amazing and big. It’s kind of the Steve Jobs “make a dent in the universe” idea, even though he might not have actually said that. To account for that, I think incorporating some of the indiehacker principles in the startup path might actually be the most applicable and accessible solution for people. Utsav: One of the key ideas in the book that, that strikes out to me as someone who's a software engineer is that you can keep trying projects on the side. And eventually, if you're doing things right, if you're talking to customers, you will hit something that people want to buy or to use, right? You're not going to get it right the first time probably. Um, but I think that's a really important idea in this. Could you elaborate on that? There are two kinds of people: one, who builds a lot of stuff but don’t know who for. Another to-do-list app, a meditation app, you name it. So you build it, but then you can’t figure out who’ll use it. The other kind is stuck in analysis paralysis, and can’t really hone in on an idea that they want to commit to. The solution to both these personas is to forget about business and immerse yourself in the communities you care about, and try to help them. Focus on contributing to these communities. These could be slack/discord communities. For me, it was Hacker News, Dribbble, and IndieHackers. There’s a bunch of subreddits for everything. Start being a part of these communities, first by listening, and eventually by contributing. I can guarantee that if you become a useful part of the community, you share ideas, people will come up to you and talk about problems that they’re facing. For example, they’re getting paid by YouTube to produce fitness videos, but have to wait for the end of the month, and they’d really like to get paid instantly. Once a community trusts you, and you solve a problem for a specific set of people, you instantly can validate good ideas and deliver value. And iterating over ideas with this community can give you a good chance of success. Listen to the audio for the full interview! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Nikita Gupta is a Co-Founder & CTO at Symba , a platform that helps manage talent development programs like internships. Internships are one of the most effective ways for hiring at a software company, but there’s a lot of work that goes into managing successful interns. With hiring getting harder across the industry due to increased competition and funding, I thought it would be interesting to dive into understanding how to manage successful internship programs. Highlights 0:30 - What is Symba? 1:30 - Starting with the hot-takes. So, are college degrees overrated now? 5:30 - Why do I need a software platform to manage internships? 8:50 - Why do companies generally need to manage 8 - 10 platforms for internships? What have you seen? 10:30 - As a software engineer or manager, how do I make my intern successful? 13:30 - Cadence of check-ins 16:30 - With remote interns, how do you build a successful community? 18:50 - How do I measure the success/efficacy of my internship program? 21:00 - How do I know that my intern mentors/hosts are doing a good job? 25:00 - What are some concrete steps that I can take to increase my intern pool’s diversity? What should I track? 27:30 - What are some trends in the intern hiring space? 32:00 - Government investments in internship programs 33:00 - What’s your advice to the first-time intern mentor/host? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Guy Eisenkot is a Senior Director of Product Management at BridgeCrew by Prisma Cloud and was the co-founder of BridgeCrew, an infrastructure security platform. We deep dive into infrastructure security, Checkov , and BridgeCrew in this episode. I’ve personally been writing Terraform for the last few weeks, and it often feels like I’m flying blind from a reliability/security perspective. For example, it’s all too easy to create an unencrypted S3 bucket in Terraform which you’ll only find out about when it hits production (via security tools ). So I see the need for tools that lint my infrastructure as code more meaningfully, and we spend some time talking about that need. We also investigate “how did we get here”, unravel some infrastructure as code history and the story behind Checkov’s quick popularity. We talk about how ShiftLeft is often a painfully overused term, the security process in modern companies, and the future of security, in a world with ever-more infrastructure complexity. Highlights 00:00 - Why is infrastructure security important to me as a developer? 05:00 - The story of Checkov 09:00 - What need did Checkov fulfil when it was released? 10:30 - Why don’t tools like Terraform enforce good security by default? 15:30 - Why ShiftLeft is a tired, not wired concept. 20:00 - When should I make my first security hire? 24:00 - Productizing what a security hire would do. 27:00 - Amazon CodeGuru but for security fixes - Smart Fixes . 33:00 - Is it possible to write infrastructure as code checks in frameworks like Pulumi ? 37:00 - Not being an early adopter when it comes to infrastructure tools. 40:00 - The Log4J vulnerability, and the security world moving forward. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Tanmai Gopal is the founder of Hasura , an API as a service platform. Hasura lets you skip writing API layers and exposes automatic GraphQL APIs that talk to your database, trigger external actions, and much more. We talk about the implementation of a “compiler as a service”, the implications of primarily having a Haskell production codebase, their experience with GraphQL, hiring product managers technical enough to build useful features, some new and upcoming Hasura features, and riff about the current state of front-end development and `npm` hell. Highlights 00:20 - What does the name Hasura mean? 02:00 - What does Hasura do? 04:00 - Why build this layer of the stack? 08:00 - How to deal with authentication if APIs are exposed directly via the database. 26:00 - Does Hasura make production applications faster? 33:00 - JSON Aggregation in modern databases 38:00 - Why Haskell? 44:00 - How do you write quality Haskell? How does hiring for Haskell positions work out in practice? 55:00 - Application servers do much more than just talk to databases. How does Hasura provide escape hatches so that non-database interactions (for eg: talking to Stripe) work out? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Kailash Nadh is the CTO of Zerodha , India’s largest retail stockbroker. Zerodha powers a large volume of stock trades - ~15-20% of India’s daily volume which is significantly more daily transactions than Robinhood . Apple Podcasts | Spotify | Google Podcasts The focus of this episode is the technology and mindset behind Zerodha - the key technology choices, challenges faced, and lessons learned while building the platform over several years. As described on the company’s tech blog , Zerodha has an unconventional approach to building software - open source centric, relatively few deadlines, an incessant focus on resolving technical debt, and extreme autonomy to the small but efficient technology team. We dig into these and learn about the inner workings of one of India’s premier fintech companies. Highlights [00:43]: Can you describe the Zerodha product? Could you also share any metrics that demonstrate the scale, like the number of transactions or number of users? Zerodha is an online stockbroker. You can download one of the apps and sign to buy and sell shares in the stock market, and invest. We have over 7 million customers, on any given day we have over 2 million concurrent users, and this week, we broke our record for a number of trades handled in a day - 14 million trades in a day, which represented over 20% of all Indian stock-trading activity. [03:00] When a user opens the app at 9:15 in the morning to see trade activity and purchase a trade, what happens behind the scenes? Life of a Query, Zerodha Edition [05:00] What exactly is the risk management system doing? Can you give an example of where it will block a trade? What is the risk management system doing? The most critical check is a margin check - whether you have enough purchasing power margins in your account. With equities, it’s a simple linear check of whether you have enough, but for derivatives, it’s about figuring out if you have enough margins. If you already have some futures and options in your account, the risk is variable based on that pre-existing amount. What does the reconciliation process look like with the exchange? We have a joke in our engineering team that we’re just CSV engineers since re-conciliation in our industry happens via several CSV files that are distributed at the end of the trading day. [08:40] Are you still using PostgreSQL for storing data? We still use (abuse) PostgreSQL with hundreds of billions of rows of data, sharded several ways [09:40] In general, how has Zerodha evolved over time, from the v0 of the tech product to today? From 2010 to 2013, there was no tech team, and Zerodha’s prime value add was a discount pricing model. We had vendor products that let users log in and trade, and the competition was on pricing. But they worked on 1/10,000th the scale that we operate on today, for a tiny fraction of the userbase. To give a sense of their maturity, they only worked on Internet Explorer 6. So in late 2014, we built a reporting platform that replaced this vendor-based system. We kept on replacing systems and dependencies, and the last piece left is the OMS - the Order Management System. We’ve had a project to replace this OMS ongoing for 2.5 years and are currently an running internal beta, and once this is complete, we will have no external dependencies. The first version of Kite, written in Python, came out in 2015. Then, we rewrote some of the services in Go. We now have a ton of services that do all sorts of things like document verification, KYC, payments, banking, integrations, trading, PNL, number crunching and analytics, visualizations, mutual funds, absolutely everything you can imagine. [13:55] Why is it so tricky to rebuild an Order Management System? There’s no spec out there to build an Order or a Risk Management System. A margin check is based on mathematical models that take a lot of different parameters into account. We’re doing complex checks that are based on mathematical models that we’ve reverse-engineered after years of experience with the system, as well as developing deep domain knowledge in the area. And once we build out the system, we cannot simply migrate away from the old system due to the high consequences of potential errors. So we need to test and migrate piecemeal from the system. [13:55] One thing you notice when Zerodha is how fast it feels compared to standard web applications. This needs focus on both backend and frontend systems. To start with, how do you optimize your backends for speed? When an application is slow (data takes more than a second to load), it’s perceptible, and can be annoying for users. So we're very particular about making everything as fast as possible, and we’ve set high benchmarks for ourselves. We set an upper limit of mean latency for users to be no more than 40 milliseconds, which seems to work well for us, given all the randomness from the internet. Then, all the code we write has to meet this benchmark. In order to make this work, there’s no black magic, just common sense principles. For the core flow of the product, everything is retrieved from in-memory databases, and nothing touches disk in the hot path of a request. Serialization is expensive. If you have a bunch of orders and you need to send those back, serializing and deserializing takes time. So when events take place, like a new order being placed, we serialize once and store the result in an in-memory database. And then when an HTTP request comes in from a user, instead of a database lookup and various transforms, the application reads directly from in-memory databases and writes it to the browser. Then, we have a few heuristics. For fetching really old reports that <2% of users use, it’s okay for those to be slow. Those will happen in separate paths so that they don’t block the more frequent kind of requests. Finally, we’ve written all these services with Golang, which is fast out of the box, provides a reasonably good developer experience, and has good concurrency primitives. We’re careful with memory allocations and pool resources wherever applicable. [24:00] Zerodha also seems to have skipped the React world, by going with Flutter on mobile and Vue on the web. Can you speak to that decision for mobile apps? We initially built the iOS app in React Native about 3-4 years ago. I’m not sure how things are today, but the application was fairly slow. The bread and butter of a trading application is a bunch of ticking numbers to show stock values, and we experienced a bunch of frame drops while trying to render those. You’d think that it’d be trivial to show that with low latency in 2017, but we were experiencing 5-10 frame/s, and Indian smartphones weren’t extremely powerful then. We also ran into several library/dependency breakages. We then randomly ran across Flutter, which in pre-alpha, and not a lot had been written in it. Picking a bleeding-edge technology is very risky, and we wanted to evaluate the risk carefully, so we built out a full-blown prototype of the Kite app that had all the parts we thought would be bottlenecks, like web socket connections, updating numbers, list views, navigation, transitions. We learned Dart (which you need to use Flutter). Once we built out the prototype, it was clear that the performance and experience with Flutter was significantly better than React Native. So we made that very early decision to ditch React Native and adopt Flutter. We figured that even if Flutter got killed, we’d benefit from using Flutter for a few years and would eventually move to something else, and traded-off the risk. We launched the iOS application, fixed up issues, killed our Android application, and rolled out a Flutter version. [28:30] How about not going with React on the website, and going with Vue? We built our v1 with Angular, but we found it tricky to use for our small team, and there was the major version break fiasco. It was overly complicated even after months of use. We decided to skip out on sunk costs and decided to evaluate something new. Picking Vue over React was primarily a judgment call, the template system reminded of us Django and it felt easier to work with compared to wrapping HTML in JSX and function calls. [30:30] How do you verify that your systems are correct and consistent? Could you walk us through the time when commodity prices went negative, what happened with your systems? When commodity prices went negative, we lost a bunch of money, just like many other brokers and institutions. Thankfully, the Indian exchanges shut down trading after a while, and I think the exchanges themselves weren’t equipped to handle negative commodity prices. The nature of the stock market is that it’s extremely complex and unquantifiable, and the complexity comes from human psychology and nature which is hard to account for. A bit of news could come up that could shake up the market. There’s price volatility, but with India, also regulatory financial volatility. Regulations come and change how brokers work overnight. It’s all correct in spirit to improve things for Indian investors, but massive changes nonetheless. Some changes completely alter how broking works in India, all with a month’s notice. So change here is the only constant, and change management is complex, slow, and risky. For the technical stuff, we do the standard unit tests, integration tests, but we do a ton of QA after. So after developers have validated changes, the application is handed to all the various domain experts across the company test and QA changes. Their job is to try and break the system. This is very important as there are a lot of behavior changes that are tricky to quantify. For example, a regulation might come in that requires stock splits in a certain way, and it’s hard to back-test since it’s never been implemented that way before. Due to the inherent complexity and rate of change, it’s not feasible to implement a stock market’s model in a test. So the combination of automated testing and manual QA by domain experts is our first step, after which we release an internal beta, and we slowly ramp up. Thankfully, this has worked for us. [35:30] Release Cycles and Technical Decision Making at Zerodha In terms of our release cycles, we move slowly and with care. If we feel that technical debt is mounting, we will pause feature development and address the debt. We’ve rewritten core systems several times when the benefits were apparent. One of the really unique things about Zerodha is that technical decisions are entirely driven by technical folk. There’s no business folk who come and say - don’t fix that system but add this new feature instead. There’s no pressure to ship features, we don’t commit to shipping features every quarter, and there are no absurd goals. We agree that critical bug fixes should be fixed in a timely way. We will implement features only if it makes sense. Sometimes, if a feature makes business sense, but the system is not ready for it, and there might be a hacky way to implement it. We never add hacks, instead, we clean up the debt, make the system amenable to the feature, and only then add it. This sounds slow, but it pays off in the long run. Because we’ve never let technical debt mount, and never compromised on hacks, we’ve been able to build things faster. After every refactor, the next set of features end up being implemented extremely quickly. Even with continuous regulatory changes, we’re always able to keep up and implement whatever’s necessary. Ironically, we’ve shipped a lot of things fast and well by slowing down. [38:00] What’s different about Zerodha that it allows its tech team full autonomy? It’s common sense, really. The technology team does not have a full understanding of the business side of things, so it lets the business decide what to build. Likewise, the business does not really have context on debt and demand technical changes. If you never pause to clear technical debt, it grows exponentially, and the system becomes a burden. I think if people had empathy to understand that concerns like tech debt were legitimate, software companies would likely be much more productive than they are now. The other side of this is deadlines. Technical people find it extremely difficult to come up with deadlines due to the complexity of the space. Even as domain experts, the estimated time for a small task might be weeks, and a seemingly tricky task could just take hours. The core is to let technical people make these decisions. [42:00] As a technical person, how do I know that there’s too much technical debt? What frameworks can I use to understand that I should probably invest in the foundation, and how do I develop that intuition? Intuition is the unquantifiable summation of past experiences. The more experienced you are, the more you can develop your intuition. But there are simple metrics you can use to make these decisions if you’re a competent developer. When you find it hard to collaborate, hard to ship new things, there are consistent performance bottlenecks, these are simple, commonplace signs that something’s wrong. If you realize that if certain parts of your code were slightly more modular, we could have shipped these 4-5 features faster. It’s contextual, but you know it when you see it. These indicators, such as difficulties, annoyances, and bottlenecks, indicate debt and burden. [45:00] As a final question, if I’m a software engineer looking for a job where they value technical quality, how would you suggest I evaluate from the outside? First and most importantly, make sure to cut through the hype, and join a company only because others think it’s a good idea to do so. Look past your biases and evaluate in a data-driven way. Most importantly, the kind of software produced by the company is the best indicator of the culture of the company. Also, try to find resources that serve as an indicator of culture and engineering practices. The reality is that most software engineering is really, really boring work. And innovation generally comes in spurts. And once you innovate, you need to build it into a usable system, which involves a lot of boilerplate. So most software engineering is boring, and that realization only comes in with experience. Once you know that, you will be in a better position to make trade-offs, and you’ll look for companies with a better culture or other parameters that are important to you in your decision-making. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Ganesh Datta is the CTO and co-founder of Cortex , a microservice management platform. Apple Podcasts | Spotify | Google Podcasts We continue the age-old monolith/microservice debate and dig into why companies seem to like services so much ( I’m generally cautious about such migrations ). Ganesh has a ton of insights into developer productivity and tooling to make engineering teams successful that we dive into. Highlights 00:00 - Why solve the service management problem? 06:00 - When to drive a monolith → service migration? What inflection points should one think about to make that decision? 08:30 - What would Ganesh do his next service migration? 10:30 - What tools are useful when migrating to services? 12:00 - Standardizing infrastructure to facilitate migrations. How much to standardize (à la Google), to letting team make their own decisions (à la Amazon)? 17:30 - How does a tool like Cortex help with these problems? 21:30 - How opinionated should such tools be? How much user education is part of building such tools? 27:00 - What are the key cultural components of successful engineering teams? 31:00 - Tactically, what does good service management look like today? 37:00 - What’s the cost/benefit ratio of shipping an on-prem product vs. a SaaS tool? 41:30 - What would your advice be for the next software engineer embarking on their monolith → microservice migration? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Johannes Schindelin is the maintainer (BDFL) of Git for Windows. Apple Podcasts | Spotify | Google Podcasts Git is a fundamental piece of the software community, and we get to learn the history and inner workings of the project in this episode. Maintaining a widely-used open source project involves a ton of expected complexity around handling bug reports, deprecations, and inclusive culture, but also requires management of inter-personal relationships, ease of contribution, and other aspects that are fascinating to learn about. Highlights 00:06 - How did Johannes end up as the maintainer of Git for Windows? 06:30 - The Git community in the early days. Fun fact: Git used to be called `dircache` 08:30 - How many downloads does Git for Windows get today? 10:15 - Why does Git for Windows a separate project? Why not make improvements to Git itself? 24:00 - How do you deprecate functionality when there are millions of users of your product and you have no telemetry? 30:00 - What does being the BDFL of a project mean? What does Johannes day-to-day look like? 33:00 - What is GitGitGadget ? How does it make contributions easier? 41:00 - How do foster an inclusive community of an open-source project? 50:00 - What’s next for Git? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Guido van Rossum is the creator of the Python programming language and a Distinguished Engineer at Microsoft. Apple Podcasts | Spotify | Google Podcasts We discuss Guido’s new work on making CPython faster ( PEP 659 ), Tiers of Python Interpreter Execution , and high impact, low hanging fruit performance improvements. Highlights (an edited summary) [00:21] What got you interested in working on Python performance? Guido: In some sense, it was probably a topic that was fairly comfortable to me because it means working with a core of Python, where I still feel I know my way around. When I started at Microsoft, I briefly looked at Azure but realized I never enjoyed that kind of work at Google or Dropbox. Then I looked at Machine Learning, but it would take a lot of time to do something interested with the non-Python, and even Python-related bits. [02:31] What was different about the set of Mark Shannon’s ideas on Python performance that convinced you to go after them? Guido: I liked how he was thinking about the problem. Most of the other approaches around Python performance like PyPy and Cinder are not suitable for all use cases since they aren’t backward compatible with extension modules. Mark has the perspective and experience of a CPython developer, as well as a viable approach that would maintain backward compatibility, which is the hardest problem to solve. The Python Bytecode interpreter is modified often across minor releases (for eg: 3.8 → 3.9) for various reasons like new opcodes, so modifying that is a relatively safe approach. Utsav: [09:45] Could you walk us through the idea of the tiers of execution of the Python Interpreter ? Guido: When you execute a program, you don't know if it's going to crash after running a fraction of a millisecond, or whether it's going to be a three-week-long computation. Because it could be the same code, just in the first case, it has a bug. And so, if it takes three weeks to run the program, maybe it would make sense to spend half an hour ahead of time optimizing all the code that's going to be run. But obviously, especially in dynamic languages like Python, where we do as much as we can without asking the user to tell us exactly how they need it done, you just want to start executing code as quickly as you can. So that if it's a small script, or a large program that happens to fail early, or just exits early for a good reason, you don't spend any time being distracted by optimizing all that code. So, what we try to do there is keep the bytecode compiler simple so that we get to execute the beginning of the code as soon as possible. If we see that certain functions are being executed many times over, then we call that a hot function, and some definition of “hot”. For some purposes, maybe it's a hot function if it gets called more than once, or more than twice, or more than 10 times. For other purposes, you want to be more conservative, and you can say, “Well, it's only hot if it's been called 1000 times.” The specializing adaptive compiler ( PEP 659 ) then tries to replace certain bytecodes with bytecodes that are faster, but only work if the types of the arguments are specific types. A simple hypothetical example is the plus operator in Python. It can add lots of things like integers, strings, lists, or even tuples. On the other hand, you can't add an integer to a string. So, the optimization step - often called quickening, but usually in our context, we call it specializing - is to have a separate “binary add” integer bytecode, a second-tier bytecode hidden from the user. This opcode assumes that both of its arguments are actual Python integer objects, reaches directly into those objects to find the values, adds those values together in machine registers, and pushes the result back on the stack. The binary adds integer operation still has to make a type check on the arguments. So, it's not completely free but a type check can be implemented much faster than a sort of completely generic object-oriented dispatch, like what normally happens for most generic add operations. Finally, it's always possible that a function is called millions of times with integer arguments, and then suddenly a piece of data calls it with a floating-point argument, or something worse. At that point, the interpreter will simply execute the original bytecode. That's an important part so that you still have the full Python semantics. Utsav [18:20] Generally you hear of these techniques in the context of JIT, a Just-In-Time compiler, but that’s not being implemented right now. Just-In-Time compilation has a whole bunch of emotional baggage with it at this point that we're trying to avoid. In our case, it’s unclear what and when we’re exactly compiling. At some point ahead of program execution, we compile your source code into bytecode. Then we translate the bytecode into specialized bytecode. I mean, everything happens at some point during runtime, so which part would you call Just-In-Time? Also, it’s often assumed that Just-In-Time compilation automatically makes all your code better. Unfortunately, you often can't actually predict what the performance of your code is going to be. And we have enough of that with modern CPUs and their fantastic branch prediction. For example, we write code in a way that we think will clearly reduce the number of memory accesses. When we benchmark it, we find that it runs just as fast as the old unoptimized code because the CPU figured out access patterns without any of our help. I wish I knew what went on in modern CPUs when it comes to branch prediction and inline caching because that is absolute magic. Full Transcript Utsav: [00:14] Thank you, Guido, for joining me on another episode of the Software at Scale podcast. It's great to have you here. Guido: [00:20] Great to be here on the show. Utsav: [00:21] Yeah. And it's just fun to talk to you again. So, the last time we spoke was at Dropbox many, many years ago. And you got retired, and then you decided that you wanted to do something new. And you work on performance now at Microsoft, and that's amazing. So, to start off with, I just want to ask you, you could pick any project that you wanted to, based on some slides that I've seen. So, what got you interested in working on Python performance? Guido: [00:47] In some sense, it was probably a topic that was fairly comfortable to me because it means working with a core of Python, where I still feel I know my way around. Some other things I considered briefly in my first month at Microsoft, I looked into, “Well, what can I do with Azure?”, and I almost immediately remembered that I was not cut out to be a cloud engineer. That was never the fun part of my job at Dropbox. It wasn't the fun part of my job before that at Google either. And it wouldn't be any fun to do that at Microsoft. So, I gave up on that quickly. I looked in machine learning, which I knew absolutely nothing about when I joined Microsoft. I still know nothing, but I've at least sat through a brief course and talked to a bunch of people who know a lot about it. And my conclusion was actually that it's a huge field. It is mostly mathematics and statistics and there is very little Python content in the field. And it would take me years to do anything interesting with the non-Python part and probably even with the Python part, given that people just write very simple functions and classes, at best in their machine learning code. But at least I know a bit more about the terminology that people use. And when people say kernel, I now know what they mean. Or at least I'm not confused anymore as I was before. Utsav: [02:31] That makes sense. And that is very similar to my experience with machine learning. Okay, so then you decided that you want to work on Python performance, right? And then you are probably familiar with Mark Shannon's ideas? Guido: [02:43] Very much so. Yeah. Utsav: [02:44] Yeah. So, was there anything different about the set of ideas that you decided that this makes sense and I should work on a project to implement these ideas? Guido: [02:55] Mark Shannon's ideas are not unique, perhaps, but I know he's been working on for a long time. I remember many years ago, I went to one of the earlier Python UK conferences, where he gave a talk about his PhD work, which was also about making Python faster. And over the years, he's never stopped thinking about it. And he sort of has a holistic attitude about it. Obviously, the results remain to be seen, but I liked what he was saying about how he was thinking about it. And if you take PyPy, it has always sounded like PyPy is sort of a magical solution that only a few people in the world understand how it works. And those people built that and then decided to do other things. And then they left it to a team of engineers to solve the real problems with PyPy, which are all in the realm of compatibility with extension modules. And they never really solved that. [04:09] So you may remember that there was some usage of PyPy at Dropbox because there was one tiny process where someone had discovered that PyPy was actually so much faster that it was worth it. But it had to run in its own little process and there was no maintenance. And it was a pain, of course, to make sure that there was a version of PyPy available on every machine. Because for the main Dropbox application, we could never switch to PyPy because that depended on 100 different extension modules. And just testing all that code would take forever. [04:49] I think since we're talking about Dropbox, Pyston was also an interesting example. They've come back actually; you've probably heard that. The Pyston people were much more pragmatic, and they've learned from PyPy’s failures. [05:04] But they have always taken this attitude of, again, “we're going to start with CPython,” which is good because that way they are sort of guaranteed compatibility with extension modules. But still, they make these huge sets of changes, at least Pyston one, and they had to roll back a whole bunch of things because, again, of compatibility issues, where I think one of the things, they had a bunch of very interesting improvements to the garbage collection. I think they got rid of the reference counting, though. And because of that, the behavior of many real-world Python programs was completely changed. [05:53] So why do I think that Mark's work will be different or Mark's ideas? Well, for one, because Mark has been in Python core developer for a long time. And so, he knows what we're up against. He knows how careful we have with backwards compatibility. And he knows that we cannot just say get rid of reference counting or change the object layout. Like there was a project that was recently released by Facebook basically, was born dead, or at least it was revealed to the world in its dead form, CI Python ( Cinder ) , which was a significantly faster Python implementation, but using sort of many of the optimizations came from changes in object layout that just aren't compatible with extension modules. And Mark has sort of carved out these ideas that work on the bytecode interpreter itself. [06:58] Now, the bytecode is something where we know that it's not going to sort of affect third-party extension modules too much if we change it, because the bytecode changes in every Python release. And internals of the interpreter of the bytecode interpreter, change in every Python release. And yes, we still run into the occasional issue. Every release, there is some esoteric hack that someone is using that breaks. And they file an issue in the bug tracker because they don't want to research or they haven't yet researched what exactly is the root cause of the problem, because all they know is their users say, “My program worked in Python 3.7, and it broke in Python 3.8. So clearly, Python 3.8 broke something.” And since it only breaks when we're using Library X, it must be maybe Library X's fault. But Library X, the maintainers don't know exactly what's going on because the user just says it doesn't work or give them a thousand-line traceback. And they bounce it back to core Python, and they say, “Python 3.8 broke our library for all our users, or 10% of our users,” or whatever. [08:16] And it takes a long time to find out, “Oh, yeah, they're just poking inside one of the standard objects, using maybe information they gleaned from internal headers, or they're calling a C API that starts with an underscore.” And you're not supposed to do that. Well, you can do that but then you pay the price, which is you have to fix your code at every next Python release. And in between, sort of for bug fix releases like if you go from 3.8.0 to 3.8.1, all the way up to 3.8.9, we guarantee a lot more - the bytecodes stay stable. But 3.9 may break all your hacks and it changes the bytecode. One thing we did I think in 3.10, was all the jumps in the bytecode are now counted in instructions rather than bytes, and instructions are two bytes. Otherwise, the instruction format is the same, but all the jumps jump a different distance if you don't update your bytecode. And of course, the Python bytecode compiler knows about this. But people who generate their own bytecode as a sort of the ultimate Python hack would suffer. Utsav: [09:30] So the biggest challenge by far is backwards compatibility. Guido: [09:34] It always is. Yeah, everybody wants their Python to be faster until they find out that making it faster also breaks some corner case in their code. Utsav: [09:45] So maybe you can walk us through the idea of the tiers of execution or tiers of the Python interpreter that have been described in some of those slides. Guido: [09:54] Yeah, so that is a fairly arbitrary set of goals that you can use for most interpreted languages. Guido: [10:02] And it's actually a useful way to think about it. And it's something that we sort of plan to implement, it's not that there are actually currently tiers like that. At best, we have two tiers, and they don't map perfectly to what you saw in that document. But the basic idea is-- I think this also is implemented in .NET Core. But again, I don't know if it's sort of something documented, or if it's just this is how their optimizer works. So, when you just start executing a program, you don't know if it's going to crash after running a fraction of a millisecond, or whether it's going to be a three-week-long computation. Because it could be the same code, just in the first case, it has a bug. And so, if it takes three weeks to run the program, maybe it would make sense to spend half an hour ahead of time optimizing all the code that's going to be run. But obviously, especially in dynamic language, and something like Python, where we do as much as we can without asking the user to tell us exactly how they need it done, you just want to start executing the code as quickly as you can. So that if it's a small script, or a large program that happens to fail early, or just exits early for a good reason, you don't spend any time being distracted by optimizing all that code. [11:38] And so if this was a statically compiled language, the user would have to specify that basically, when they run the compiler, they say, “Well, run a sort of optimize for speed or optimize for time, or O2, O3 or maybe optimized for debugging O0.” In Python, we try not to bother the user with those decisions. So, you have to generate bytecode before you can execute even the first line of code. So, what we try to do there is keep the bytecode compiler simple, keep the bytecode interpreter simple, so that we get to execute the beginning of the code as soon as possible. If we see that certain functions are being executed many times over, then we call that a hot function, and you can sort of define what's hot. For some purposes, maybe it's a hot function if it gets called more than once, or more than twice, or more than 10 times. For other purposes, you want to be more conservative, and you can say, “Well, it's only hot if it's been called 1000 times.” [12:48] But anyway, for a hot function, you want to do more work. And so, the specializing adaptive compiler, at that point, tries to replace certain bytecodes with bytecodes that are faster, but that work only if the types of the arguments are specific types. A simple example but pretty hypothetical is the plus operator in Python at least, can add lots of things. It can add integers, it can add floats, it can add strings, it can list or tuples. On the other hand, you can't add an integer to a string, for example. So, what we do there, the optimization step - and it's also called quickening, but usually in our context, we call it specializing - is we have a separate binary add integer bytecode. And it's sort of a second-tier bytecode that is hidden from the user. If the user asked for the disassembly of their function, they will never see binary add integer, they will also always see just binary add. But what the interpreter sees once the function has been quickened, the interpreter may see binary add integers. And the binary add integer just assumes that both of its arguments, that's both the numbers on the stack, are actual Python integer objects. It just reaches directly into those objects to find the values, adds those values together in machine registers, and push the result back on the stack. [14:35] Now, there are all sorts of things that make that difficult to do. For example, if the value doesn't fit in a register for the result, or either of the input values, or maybe even though you expected it was going to be adding two integers, this particular time it's going to add to an integer and a floating-point or maybe even two strings. [15:00] So the first stage of specialization is actually… I'm blanking out on the term, but there is an intermediate step where we record the types of arguments. And during that intermediate step, the bytecode actually executes slightly slower than the default bytecode. But that only happens for a few executions of a function because then it knows this place is always called with integers on the stack, this place is always called with strings on the stack, and maybe this place, we still don't know or it's a mixed bag. And so then, the one where every time it was called during this recording phase, it was two integers, we replace it with that binary add integer operation. The binary adds integer operation, then, before it reaches into the object, still has to make a type check on the arguments. So, it's not completely free but a type check can be implemented much faster than a sort of completely generic object-oriented dispatch, like what normally happens for the most generic binary add operations. [16:14] So once we've recorded the types, we specialize it based on the types, and the interpreter then puts in guards. So, the interpreter code for the specialized instruction has guards that check whether all the conditions that will make the specialized instruction work, are actually met. If one of the conditions is not met, it's not going to fail, it's just going to execute the original bytecode. So, it's going to fall back to the slow path rather than failing. That's an important part so that you still have the full Python semantics. And it's always possible that a function is called hundreds or millions of times with integer arguments, and then suddenly a piece of data calls it with a floating-point argument, or something worse. And the semantics still say, “Well, then it has to do with the floating-point way. Utsav: [17:12] It has to deoptimize, in a sense. Guido: [17:14] Yeah. And there are various counters in all the mechanisms where, if you encounter something that fails the guard once, that doesn't deoptimize the whole instruction. But if you sort of keep encountering mismatches of the guards, then eventually, the specialized instruction is just deoptimized and we go back to, “Oh, yeah, we'll just do it the slow way because the slow way is apparently the fastest, we can do.” Utsav: [17:45] It's kind of like branch prediction. Guido: [17:47] I wish I knew what went on in modern CPUs when it comes to branch prediction and inline caching because that is absolute magic. And it's actually one of the things we're up against with this project, because we write code in a way that we think will clearly reduce the number of memory accesses, for example. And when we benchmark it, we find that it runs just as fast as the old unoptimized code because the CPU figured it out without any of our help. Utsav: [18:20] Yeah. I mean, these techniques, generally you hear them in a context of JIT, a Just-In-Time compiler, but y’all are not implementing that right now. Guido: [18:30] JIT is like, yeah, in our case, it would be a misnomer. What we do expect to eventually be doing is, in addition to specialization, we may be generating machine code. That's probably going to be well past 3.11, maybe past 3.12. So, the release that we still have until October next year is going to be 3.11, and that's where the specializing interpreters going to make its first entry. I don't think that we're going to do anything with machine code unless we get extremely lucky with our results halfway through the year. But eventually, that will be another tier. But I don't know, Just-In-Time compilation has a whole bunch of emotional baggage with it at this point that we're trying to avoid. Utsav: [19:25] Is it baggage from other projects trying it? Guido: [19:29] People assume that Just-In-Time compilation automatically makes all your code better. It turns out that it's not that simple. In our case, compilation is like, “What exactly is it that we compile?” At some point ahead of time, we compile your source code into bytecode. Then we translate the bytecode into specialized bytecode. I mean, everything happens at some point during runtime, so which thing would you call Just-In-Time? Guido: [20:04] So I'm not a big fan of using that term. And it usually makes people think of feats of magical optimization that have been touted by the Java community for a long time. And unfortunately, the magic is often such that you can't actually predict what the performance of your code is going to be. And we have enough of that, for example, with the modern CPUs and their fantastic branch prediction. Utsav: [20:35] Speaking of that, I saw that there's also a bunch of small wins y'all spoke about, that y’all can use to just improve performance, things like fixing the place of __dict__ in objects and changing the way integers are represented. What is just maybe one interesting story that came out of that? Guido: [20:53] Well, I would say calling Python functions is something that we actually are currently working on. And I have to say that this is not just the Microsoft team, but also other people in the core dev team, who are very excited about this and helping us in many ways. So, the idea is that in the Python interpreter, up to and including version 3.10, which is going to be released next week, actually, whenever you call a Python function, the first thing you do is create a frame object. And a frame object contains a bunch of state that is specific to that call that you're making. So, it points to the code object that represents the function that's being called, it points to the globals, it has a space for the local variables of the call, it has space for the arguments, it has space for the anonymous values on the evaluation stack. But the key thing is that it’s still a Python object. And there are some use cases where people actually inspect the Python frame objects, for example, if they want to do weird stuff with local variables. [22:18] Now, if you're a debugger, it makes total sense that you want to actually look at what are all the local variables in this frame? What are their names? What are their values and types? A debugger may even want to modify a local variable while the code is stopped in a breakpoint. That's all great. But for the execution of most code, most of the time, certainly, when you're not using a debugger, there's no reason that that frame needs to be a Python object. Because a Python object has a header, it has a reference count, it has a type, it is allocated as its own small segment of memory on the heap. It's all fairly inefficient. Also, if you call a function, then you create a few objects, then from that function, you call another function, all those frame objects end up scattered throughout the entire heap of the program. [23:17] What we have implemented in our version of 3.11, which is currently just the main branch of the CPython repo, is an allocation scheme where when we call a function, we still create something that holds the frame, but we allocate that in an array of frame structures. So, I can't call them frame objects because they don't have an object header, they don't have a reference count or type, it's just an array of structures. This means that unless that array runs out of space, calls can be slightly faster because you don't jump around on the heap. And allocation sort of is to allocate the next frame, you compare two pointers, and then you bump one counter, and now you have a new frame structure. And so, creation, and also deallocation of frames is faster. Frames are smaller because you don't have the object header. You also don't have the malloc overhead or the garbage collection overhead. And of course, it's backwards incompatible. So, what do we do now? Fortunately, there aren't that many ways that people access frames. And what we do is when people call an API that returns a frame object, we say, “Okay, well sure. Here's the frame in our array. Now we're going to allocate an object and we're going to copy some values to the frame object,” and we give that to the Python code. So, you can still introspect it and you can look at the locals as if nothing has changed. [25:04] But most of the time, people don't look at add frames. And this is actually an old optimization. I remember that the same idea existed in IronPython. And they did it differently. I think for them, it was like a compile-time choice when the bytecode equivalent in IronPython was generated for a function, it would dynamically make a choice whether to allocate a frame object or just a frame structure for that call. And their big bugaboo was, well, there is a function you can call sys dunder __getFrame__ and it just gives you the frame object. So, in the compiler, they were looking, were you using the exact thing named system dunder __getFrame__ and then they would say, “Oh, that's getFrame, now we're going to compile you slightly slower so you use a frame object.” We have the advantage that we can just always allocate the frame object on the fly. But we get similar benefits. And oh, yeah, I mentioned that the frame objects are allocated in array, what happens if that array runs out? Well, it's actually sort of a linked list of arrays. So, we can still create a new array of frames, like we have space for 100 or so which, in many programs, that's plenty. And if your call stack is more than 100 deep, we'll just have one discontinuity, but the semantics are still the same and we still have most of the benefits. Utsav: [26:39] Yeah, and maybe as a wrap-up question, there are a bunch of other improvements happening in the Python community for performance as well, right? There's Mypyc, which we're familiar with, which is using types, Mypy types to maybe compiled code to basically speed up. Are there any other improvements like that, that you're excited about, or you're interested in just following? Guido: [27:01] Well, Mypyc is very interesting. It gives much better performance boost, but only when you fully annotate your code and only when you actually follow the annotations precisely at runtime. In Mypy, if you say, “This function takes two integers,” and it returns an integer, then if you call it with something else, it's going to immediately blow up. It'll give you a traceback. But the standard Python semantics are that type annotations are optional, and sometimes they're white lies. And so, the types that you see at runtime may not actually be compatible with the types that were specified in the annotations. And it doesn't affect how your program executes. Unless you sort of start introspecting the annotations, your program runs exactly the same with or without annotations. [28:05] I mean, there are a couple of big holes that are in the type system, like any. And the type checker will say, “Oh, if you put any, everything is going to be fine.” And so, using that, it's very easy to have something that is passed, an object of an invalid type, and the type checker will never complain about it. And our promise is that the runtime will not complain about it either unless it really is a runtime error. Obviously, if you're somehow adding an integer to a string at runtime, it's still going to be a problem. But if you have a function that, say, computes the greatest common divisor of two numbers, which is this really cute little loop, if you define the percent operator in just the right way, you can pass in anything. I think there are examples where you can actually pass it to strings, and it will return a string without ever failing. [29:07] And so basically, Mypyc does things like the instance attributes are always represented in a compact way where there is no dunder __dict__. The best that we can do, which we are working on designing how we're actually going to do that, is make it so that if you don't look at the dunder __dict__ attribute, we don't necessarily have to store the instance attributes in a dictionary as long as we preserve the exact semantics. But if you use the dunder __dict__, at some later point, again, just like the frame objects, we have to materialize a dictionary. And Mypyc doesn't do that. It's super-fast if you don't use dunder __dict__. If you do use dunder __dict__, it just says, “dunder __dict__ not supported in this case.” [29:59] Mypyc really only compiles a small subset of the Python language. And that's great if that's the subset you're interested in. But I'm sure you can imagine how complex that is in practice for a large program. Utsav: [30:17] It reminds me of JavaScript performance when everything is working fast and then you use this one function, which you're not supposed to use to introspect an object or something, and then performance just breaks down. Guido: [30:29] Yeah, that will happen. Utsav: [30:31] But it's still super exciting. And I'm also super thankful that Python fails loudly when you try to add a number in the string, not like JavaScript, Guido: [30:41] Or PHP, or Perl. Utsav: [30:44] But yeah, thank you so much for being a guest. I think this was a lot of fun. And I think it walked through the performance improvement y’all are trying to make in an accessible way. So, I think it’s going to be useful for a lot of people. Yeah, thank you for being a guest. Guido: [30:58] My pleasure. It’s been a fun chat. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Abhay Venkatesh is a Software Engineer at Anduril Industries where he focuses on infrastructure and platform engineering. Apple Podcasts | Spotify | Google Podcasts We focus this episode on drone engineering - exploring the theme of “If I wanted to start my own technology project/company that manages drones, what technology bits would I need to know?” We discuss the commoditization of drone hardware, the perception stack, testing and release cycles, simulation software, software invariants, defensive software architecture, and wrap up with discussing the business models behind hardware companies. Highlights 1:56 - Are we getting robot cleaners (other than Roomba) anytime soon? 5:00 - What should I do if I want to build a technology project/company that leverages drones? Where should I be innovating? 7:30 - What does the perception stack for a drone look like? 13:30 - Are drones/robots still programmed in C++? How is Rust looked at in their world? 18:30 - What does software development look like for a company that deploys software on drones? What are the testing/release processes like? 20:30 - How are simulations used? Can game engines be used for simulations to test drones? Interestingly - since neural networks perceive objects and images very differently from how brains do it, adapting drone perception to work on a game engine is actually really hard. 26:30 - Drone programming can be similar to client-side app development. But you have to write your own app store/auto-update infrastructure. Testing new releases manually is the largest bottleneck in releases. 30:00 - Defensive programming for drones - how do you ensure safety? What is the base safety layer that needs to be built for a drone? “Return to Base” logic - often separated out into a different CPU. 33:00 - How do hardware businesses look different from traditional SaaS businesses? 38:00 - What are some interesting trends in hardware that Abhay is excited about? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Derrick Stolee is a Principal Software Engineer at GitHub, where he focuses on the client experience of large Git repositories. Apple Podcasts | Spotify | Google Podcasts Subscribers might be aware that I’ve done some work on client-side Git in the past, so I was pretty excited for this episode. We discuss the Microsoft Windows and Office repository’s migrations to Git, recent performance improvements to Git for large monorepo, and more. Highlights lightly edited [06:00] Utsav: How and why did you transition from academia to software engineering? Derrick Stolee : I was teaching and doing research at a high level and working with really great people. And I found myself not finding the time to do the work I was doing as a graduate student. I wasn't finding time to do the programming and do these really deep projects. I found that the only time I could find to do that was in the evenings and weekends because that's when other people weren't working, who could collaborate with me on their projects and move those projects forward. And then, I had a child and suddenly my evenings and weekends aren't available for that anymore. And so the individual things I was doing just for myself and for, you know, that was more programming oriented, fell by the wayside. I'd found myself a lot less happy with that career. And so I decided, you know what, there are two approaches I could take here. One is I could spend the next year or two winding down my collaborations and spinning up more of this time to be working on my own during regular work hours. Or I could find another job and I was going to set out. And, I lucked out that Microsoft has an office here in Raleigh, North Carolina, where we now live. This is where Azure DevOps was being built and they needed someone to help solve some graph problems. So it was really nice that it happened to work out that way. I know for a fact that they took a chance on me because of their particular need. I didn't have significant professional experience in the industry. [21:00] Utsav: What drove the decision to migrate Windows to Git? The Windows repository moving to Git was a big project driven by Brian Harry , who was the CVP of Azure DevOps at the time. Previously, Windows used this source control system called Source Depot, which was a fork of Perforce. No one knew how to use this version control system until they got there and learned on the job, and that caused some friction in terms of onboarding people. But also if you have people working in the windows code base for a long time, they only learn this version control system. They don't know Git and they don't know what everyone else is using. And so they're feeling like they're falling behind and they're not speaking the same language when they talk to somebody else who's working commonly used version control tools. So they saw this as a way to not only update the way their source control works to a more modern tool but specifically allow this more free exchange of ideas and understanding. The Windows Git repository is going to be big and have some little tweaks here and there, but at the end of the day, you're just running Git commands and you can go look at StackOverflow to solve questions as opposed to needing to talk to specific people within the Windows organization and how to use this version control tool. Transcript Utsav Shah: Welcome to another episode of the Software at Scale Podcast, joining me today is Derek Stolee, who is a principal software engineer at GitHub. Previously, he was a principal software engineer at Microsoft, and he has a Ph.D. in Mathematics and Computer Science from the University of Nebraska, welcome. Derek Stolee: Thanks, happy to be here. Utsav Shah: So a lot of work that you do on Git, from my understanding, it's similar to the work you did in your Ph.D. around graph theory and stuff. So maybe you can just walk through the initial like, what got you interested in graphs and math in general? Derek Stolee: My love of graph theory came from my first algorithms class in college my sophomore year, just doing simple things like path-finding algorithms. And I got so excited about it, I started clicking around Wikipedia constantly, I just read every single article I could find on graph theory. So I learned about the four-color theorem, and I learned about different things like cliques, and all sorts of different graphs, the Peterson graph, and I just kept on discovering more. I thought this is interesting to me, it works well with the way my brain works and I could just model these things while [unclear 01:32]. And as I kept on doing more, for instance, graph theory, and combinatorics, my junior year for my math major, and it's like I want to pursue this. Instead of going into the software, I had planned with my undergraduate degree, I decided to pursue a Ph.D. in first math, then I split over to the joint math and CS program, and just worked on very theoretical math problems but I also would always pair it with the fact that I had this programming background and algorithmic background. So I was solving pure math problems using programming, and creating these computational experiments, the thing I call it was computational competent works. Because I would write these algorithms to help me solve these problems that were hard to reason about because the cases just became too complicated to hold in your head. But if you could quickly write a program, to then over the course of a day of computation, discover lots of small examples that can either answer it for you or even just give us a more intuitive understanding of the problem you're trying to solve and that was my specialty as I was working in academia. Utsav Shah: You hear a lot about proofs that are just computer-assisted today and you could just walk us through, I'm guessing, listeners are not math experts. So why is that becoming a thing and just walk through your thesis read in super layman terms, what do you do? Derek Stolee: There are two very different ways what you can mean when you say I have automated proof, there are some things like Coke, which are completely automated formal logic proofs, which specify all the different axioms and the different things I know to be true. And the statement I want to prove and constructs the sequence of proof steps, what I was focused more on was taking a combinatorial problem. For instance, do graphs with certain sub-structures exist, and trying to discover those examples using an algorithm that was finely tuned to solve those things, so one problem was called uniquely Kr saturated graphs. A Kr was essentially a set of our vertices where every single pair was adjacent to each other and to be saturated means I don't have one inside my graph but if I add any missing edge, I'll get one. And then the uniquely part was, I'll get exactly one and now we're at this fine line of doing these things even exist and can I find some interesting examples. And so you can just do, [unclear 04:03] generate every graph of a certain size, but that blows up in size. And so you end up where you can get maybe to 12 vertices, every graph of up to 12 vertices or so you can just enumerate and test. But to get beyond that, and find the interesting examples, you have to be zooming in on the search space to focus on the examples you're looking for. And so I generate an algorithm that said, Well, I know I'm not going to have every edge, so it's fixed one, parents say, this isn't an edge. And then we find our minus two other vertices and put all the other edges in and that's the one unique completion of that missing edge. And then let's continue building in that way, by building up all the possible ways you can create those sub-structures because they need to exist as opposed to just generating random little bits and that focus the search space enough that we can get to 20 or 21 vertices and see this interesting shapes show up. From those examples, we found some infinite families and then used regular old-school math to prove that these families were infinite once we had those small examples to start from. Utsav Shah: That makes a lot of sense and that tells me a little bit about how might someone use this in a computer science way? When would I need to use this in let's say, not my day job but just like, what computer science problems would I solve given something like that? Derek Stolee: It's always asking a mathematician what the applications of the theoretical work are. But I find whenever you see yourself dealing with a finite problem, and you want to know what different ways can this data be up here? Is it possible with some constraints? So a lot of things I was running into were similar problems to things like integer programming, trying to find solutions to an integer program is a very general thing and having those types of tools in your back pocket to solve these problems is extremely beneficial. And also knowing integer programming is still NP-hard. So if you have the right data shape, it will take an exponential amount of time to work, even though there are a lot of tools to solve most cases, when your data looks aren't particularly structured to have that exponential blow up. So knowing where those data shapes can arise and how to take a different approach can be beneficial. Utsav Shah: And you've had a fairly diverse career after this. I'm curious, what was the difference? What was the transition from doing this stuff to get or like developer tools? How did that end up happening? Derek Stolee: I was lucky enough that after my Ph.D. was complete, I landed a tenure track job in a math and computer science department, where I was teaching and doing research at a high level and working with great people. I had the best possible accountant’s workgroup, I could ask for doing interesting stuff, working with graduate students. And I found myself not finding the time to do the work I was doing as a graduate student, I wasn't finding time to do the programming and do these deep projects I wanted, I had a lot of interesting math project projects, I was collaborating with a lot of people, I was doing a lot of teaching. But I was finding that the only time I could find to do that was in evenings and weekends because that's when other people weren't working, who could collaborate with me on their projects and move those projects forward. And then I had a child and suddenly, my evenings and weekends aren't available for that anymore. And so the individual things I was doing just for myself, and for that we're more programming oriented, fell by the wayside and found myself a lot less happy with that career. And so I decided, there are two approaches I could take here; one is I could spend the next year or two, winding down my collaborations and spinning up more of this time to be working on my own during regular work hours, or I could find another job. And I was going to set out, but let's face it, my spouse is also an academic and she had an opportunity to move to a new institution and that happened to be soon after I made this decision. And so I said, great, let's not do the two-body problem anymore, you take this job, and we move right in between semesters, during the Christmas break, and I said, I will find my job, I will go and I will try to find a programming job, hopefully, someone will be interested. And I lucked out that, Microsoft has an office here in Raleigh, North Carolina, where we now live and they happen to be the place where what is now known as Azure DevOps was being built. And they needed someone to help solve some graph theory problems in the Git space. So it was nice that it happened to work out that way and I know for a fact that they took a chance on me because of their particular need. I didn't have significant professional experience in the industry, I just said, I did academics, so I'm smart and I did programming as part of my job, but it was always about myself. So, I came with a lot of humility, saying, I know I'm going to learn to work with a team. in a professional setting. I did teamwork with undergrad, but it's been a while. So I just come in here trying to learn as much as I can, as quickly as I can, and contribute in this very specific area you want me to go into, and it turns out that area they needed was to revamp the way Azure Repos computed Git commit history, which is a graph theory problem. The thing that was interesting about that is the previous solution is that they did everything in the sequel they'd when you created a new commit, he would say, what is your parent, let me take its commit history out of the sequel, and then add this new commit, and then put that back into the sequel. And it took essentially a sequel table of commit IDs and squashes it into a varbinary max column of this table, which ended up growing quadratically. And also, if you had a merge commit, it would have to take both parents and interestingly merge them, in a way that never matched what Git log was saying. And so it was technically interesting that they were able to do this at all with a sequel before I came by. But we need to have the graph data structure available, we need to dynamically compute by walking commits, and finding out how these things work, which led to creating a serialized commit-graph, which had that topological relationship encoded in concise data, into data. That was a data file that would be read into memory and very quickly, we could operate on it and do things topologically sorted. And we could do interesting File History operations on that instead of the database and by deleting these Database entries that are growing quadratically, we saved something like 83 gigabytes, just on the one server that was hosting the Azure DevOps code. And so it was great to see that come into fruition. Utsav Shah: First of all, that's such an inspiring story that you could get into this, and then they give you a chance as well. Did you reach out to a manager? Did you apply online? I'm just curious how that ended up working? Derek Stolee: I do need to say I have a lot of luck and privilege going into this because I applied and waited a month and didn't hear anything. I had applied to the same group and said, here's my cover letter, I heard nothing but then I have a friend who was from undergrad, who was one of the first people I knew to work at Microsoft. And I knew he worked at this little studio as the Visual Studio client editor and I said, well, this thing, that's now Azure DevOps was called Visual Studio online at the time, do you know anybody from this Visual Studio online group, I've applied there, haven't heard anything I'd love if you could get my resume on the top list. And it turns out that he had worked with somebody who had done the Git integration in Visual Studio, who happened to be located at this office, who then got my name on the top of the pile. And then that got me to the point where I was having a conversation with who would be my skip-level manager, and honestly had a conversation with me to try to suss out, am I going to be a good team player? There's not a good history of PhDs working well with engineers, probably because they just want to do their academic work and work in their space. I remember one particular question is like, sometimes we ship software and before we do that, we all get together, and everyone spends an entire day trying to find bugs, and then we spend a couple of weeks trying to fix them, they call it a bug bash, is that something you're interested in doing? I'm 100% wanting to be a good citizen, good team member, I am up for that. I that's what it takes to be a good software engineer, I will do it. I could sense the hesitation and the trepidation about looking at me more closely but it was overall, once I got into the interview, they were still doing Blackboard interviews at that time and I felt unfair because my phone screen interview was a problem. I had assigned my C Programming students as homework, so it's like sure you want to ask me this, I have a little bit of experience doing problems like this. So I was eager to show up and prove myself, I know I made some very junior mistakes at the beginning, just what's it like to work on a team? What's it like to check in a change and commit that pull request at 5 pm? And then go and get in your car and go home and realize when you are out there that you had a problem? And you've caused the bill to go red? Oh, no, don't do that. So I had those mistakes, but I only needed to learn them once. Utsav Shah: That's amazing and going to your second point around [inaudible 14:17], get committed history and storing all of that and sequel he also go, we had to deal with an extremely similar problem because we maintain a custom CI server and we try doing Git [inaudible 14:26] and try to implement that on our own and that did not turn out well. So maybe you can walk listeners through like, why is that so tricky? Why it is so tricky to say, is this commit before another commit is that after another commit, what's the parent of this commit? What's going on, I guess? Derek Stolee: Yes the thing to keep in mind is that each commit has a list of a parent or multiple parents in the case of emerging, and that just tells you what happened immediately before this. But if you have to go back weeks or months, you're going to be traversing hundreds or 1000s of commits and these merge commits are branching. And so not only are we going deep in time in terms of you just think about the first parent history is all the merge all the pull requests that have merged in that time. But imagine that you're also traversing all of the commits that were in the topic branches of those merges and so you go both deep and wide when you're doing this search. And by default, Git is storing all of these commits as just plain text objects, in their object database, you look it up by its Commit SHA, and then you go find that location in a pack file, you decompress it, you go parse the text file to find out the different information about, what's its author-date, committer date, what are its parents, and then go find them again, and keep iterating through that. And it's a very expensive operation on these orders of commits and especially when it says the answer's no, it's not reachable, you have to walk every single possible commit that is reachable before you can say no. And both of those things cause significant delays in trying to answer these questions, which was part of the reason for the commit-graph file. First again, it was started when I was doing Azure DevOps server work but it's now something it's a good client feature, first, it avoids that going through to the pack file, and loading this plain text document, you have to decompress and parse by just saying, I've got it well-structured information, that tells me where in the commit-graph files the next one. So I don't have to store the whole object ID, I just have a little four-byte integer, my parent is this one in this table of data, and you can jump quickly between them. And then the other benefit is, we can store extra data that are not native to the commit object itself, and specifically, this is called generation number. The generation number is saying, if I don't have any parents, my generation number is one, so I'm at level one. But if I have parents, I'm going to have one larger number than the maximum most parents, so if I have one parent is; one, now two, and then three, if I merge, and I've got four and five, I'm going to be six. And what that allows me to do is that if I see two commits, and one is generation number 10, and one is 11, then the one with generation number 10, can't reach the one with 11 because that means an edge would go in the wrong direction. It also means that if I'm looking for the one with the 11, and I started at 20, I can stop when I hit commits that hit alright 10. So this gives us extra ways of visiting fewer commits to solve these questions. Utsav Shah: So maybe a basic question, why does the system care about what the parents of a commit are why does that end up mattering so much? Derek Stolee: Yes, it matters for a lot of reasons. One is if you just want to go through the history of what changes have happened to my repository, specifically File History, the way to get them in order is not you to say, give me all the commits that changed, and then we sort them by date because the commit date can be completely manufactured. And maybe something that was committed later emerged earlier, that's something else. And so by understanding those relationships of where the parents are, you can realize, this thing was committed earlier, it landed in the default branch later and I can see that by the way that the commits are structured to these parent relationships. And a lot of problems we see with people saying, where did my change go, or what happened here, it's because somebody did a weird merge. And you can only find it out by doing some interesting things with Git log to say, this merge caused a problem and cause your file history to get mixed up and if somebody resolved the merging correctly to cause this problem where somebody change got erased and you need to use these social relationships to discover that. Utsav Shah: Should everybody just be using rebase versus merge, what's your opinion? Derek Stolee: My opinion is that you should use rebase to make sure that the commits that you are trying to get reviewed by your coworkers are as clear as possible. Present a story, tell me that your commits are good, tell me in the comments just why you're trying to do this one small change, and how the sequence of commits creates a beautiful story that tells me how I get from point A to point B. And then you merge it into your branch with everyone else's, and then those commits are locked, you can't change them anymore. Do you not rebase them? Do you not edit them? Now they're locked in and the benefit of doing that as well, I can present this best story that not only is good for the people who are reviewing it at the moment, but also when I go back in history and say, why did I change it that way? You've got all the reasoning right there but then also you can do things like go down Do Git log dash the first parent to just show me which pull requests are merged against this branch. And that's it, I don't see people's commits. I see this one was merged, this one was merged, this one was merged and I can see the sequence of those events and that's the most valuable thing to see. Utsav Shah: Interesting, and then a lot of GitHub workflows, just squash all of your commits into one, which I think is the default, or at least a lot of people use that; any opinions on that, because I know the Git workflow for development does the whole separate by commits, and then merge all of them, do you have an opinion, just on that? Derek Stolee: Squash merges can be beneficial; the thing to keep in mind is that it's typically beneficial for people who don't know how to do interactive rebase. So their topic match looks like a lot of random commits that don't make a lot of sense. And they're just, I tried this and then I had a break. So I fixed a bug, and I kept on going forward, I'm responding to feedback and that's what it looks like. That's if those commits aren't going to be helpful to you in the future to diagnose what's going on and you'd rather just say, this pull request is the unit of change. The squash merge is fine, it's fine to do that, the thing I find out that is problematic as a new user is also then don't realize that they need to change their branch to be based on that squash merge before they continue working. Otherwise, they'll bring in those commits again, and their pull request will look very strange. So there are some unnatural bits to using squash merge, that require people to like, let me just start over from the main branch again, to do my next work. And if you don't remember to do that, it's confusing. Utsav Shah: Yes, that makes a lot of sense. So going back to your story, so you started working on improving, get interactions in Azure DevOps? When did the whole idea of let's move the windows repository to get begin and how did that evolve? Derek Stolee: Well, the biggest thing is that the windows repository moving to get was decided, before I came, it was a big project by Brian Harry, who was the CVP of Azure DevOps at the time. Windows was using this source control system called source depot, which was a literal fork of Perforce. And no one knew how to use it until they got there and learn on the job. And that caused some friction in terms of well, onboarding people is difficult. But also, if you have people working in the windows codebase, for a long time, they learn this version control system. They don't know what everyone else is using and so they're feeling like they're falling behind. And they're not speaking the same language as when they talk to somebody else who's working in the version control that most people are using these days. So they saw this as a way to not only update the way their source control works to a more modern tool but specifically Git because it allowed more free exchange of ideas and understanding, it's going to be a mono repo, it's going to be big, it's going to have some little tweaks here and there. But at the end of the day, you're just running Git commands and you can go look at Stack Overflow, how to solve your Git questions, as opposed to needing to talk to specific people within the windows organization, and how to use this tool. So that, as far as I understand was a big part of the motivation, to get it working. When I joined the team, we were in the swing of let's make sure that our Git implementation scales, and the thing that's special about Azure DevOps is that it's using, it doesn't use the core Git codebase, it has a complete reimplementation of the server-side of Git in C sharp. So it was rebuilding a lot of things to just be able to do the core features, but is in its way that worked in its deployment environment and it had done a pretty good job of handling scale. But the issues that the Linux repo was still a challenge to host. At that time, it had half a million commits, maybe 700,000 commits, and it's the site number of files is rather small. But we were struggling especially with the commit history being so deep to do that, but also even when they [inaudible 24:24] DevOps repo with maybe 200 or 300 engineers working on it and in their daily work that was moving at a pace that was difficult to keep up with, so those scale targets were things we were daily dealing with and handling and working to improve and we could see that improvement in our daily lives as we were moving forward. Utsav Shah: So how do you tackle the problem? You're on this team now and you know that we want to improve the scale of this because 2000 developers are going to be using this repository we have two or 300 people now and it's already not like perfect. My first impression is you sit and you start profiling code and you understand what's going wrong. What did you all do? Derek Stolee: You're right about the profiler, we had a tool, I forget what it's called, but it would run on every 10th request selected at random, it would run a dot net profiler and it would save those traces into a place where we could download them. And so we can say, you know what Git commit history is slow. And now that we've written it in C sharp, as opposed to a sequel, it's the C sharp fault. Let's go see what's going on there and see if we can identify what are the hotspots, you go pull a few of those traces down and see what's identified. And a lot of it was chasing that like, I made this change. Let's make sure that the timings are an improvement, I see some outliers over here, they're still problematic, we find those traces and be able to go and identify that the core parts to change. Some of them are more philosophical, we need to change data structures, we need to introduce things like generation numbers, we need to introduce things like Bloom filters for filed history, nor to speed that up because we're spending too much time parsing commits and trees. And once we get to the idea that once we're that far, it was time to essentially say, let's assess whether or not we can handle the windows repo. And I think would have been January, February 2017. My team was tasked with doing scale testing in production, they had the full Azure DevOps server ready to go that had the windows source code in it didn't have developers using it, but it was a copy of the windows source code but they were using that same server for work item tracking, they had already transitioned, that we're tracking to using Azure boards. And they said, go and see if you can make this fall over in production, that's the only way to tell if it's going to work or not. And so a few of us got together, we created a bunch of things to use the REST API and we were pretty confident that the Git operation is going to work because we had a caching layer in front of the server that was going to avoid that. And so we just went to the idea of like, let's have through the REST API and make a few changes, and create a pull request and merge it, go through that cycle. We started by measuring how often developers would do that, for instance, in the Azure DevOps, and then scale it up and see where be going and we crashed the job agents because we found a bottleneck. Turns out that we were using lib Git to do merges and that required going into native code because it's a C library and we couldn't have too many of those running, because they each took a gig of memory. And so once this native code was running out, things were crashing and so we ended up having to put a limit on how that but it was like, that was the only Fallout and we could then say, we're ready to bring it on, start transitioning people over. And when users are in the product, and they think certain things are rough or difficult, we can address them. But right now, they're not going to cause a server problem. So let's bring it on. And so I think it was that a few months later that they started bringing developers from source depot into Git. Utsav Shah: So it sounds like there was some server work to make sure that the server doesn't crash. But the majority of work that you had to focus on was Git inside. Does that sound accurate? Derek Stolee: Before my time in parallel, is my time was the creation of what's now called VFS Forget, he was GVFs, at the time, realized that don't let engineers name things, they won't do it. So we've renamed it to VFS forget, it's a virtual file system Forget, a lot of [inaudible 28:44] because the source depot, version that Windows is using had a virtualized file system in it to allow people to only download a portion of the working tree that they needed. And they can build whatever part they were in, and it would dynamically discover what files you need to run that build. And so we did the same thing on the Git side, which was, let's make the Git client let's modify in some slight ways, using our fork of Git to think that all the files are there. And then when a file is [inaudible 29:26] we look through it to a file system event, it communicates to the dot net process that says, you want that file and you go download it from the Git server, put it on disk and tell you what its contents are and now you can place it and so it's dynamically downloading objects. This required aversion approach protocol that we call the GVFs protocol, which is essentially an early version of what's now called get a partial clone, to say, you can go get the commits and trees, that's what you need to be able to do most of your work. But when you need the file contents into the blob of a file, we can download that as necessary and populate it on your disk. The different thing is that personalized thing, the idea that if you just run LS at the root directory, it looks like all the files are there. And that causes some problems if you're not used to it, like for instance, if you open the VS code in the root of your windows source code, it will populate everything. Because VS code starts crawling and trying to figure out I want to do searching and indexing. And I want to find out what's there but Windows users were used to this, the windows developers; they had this already as a problem. So they were used to using tools that didn't do that but we found that out when we started saying, VFS forget is this thing that Windows is using, maybe you could use it to know like, well, this was working great, then I open VS code, or I ran grep, or some other tool came in and decided to scan everything. And now I'm slow again, because I have absolutely every file in my mana repo, in my working directory for real. And so that led to some concerns that weren’t necessarily the best way to go. But it did specifically with that GFS protocol, it solved a lot of the scale issues because we could stick another layer of servers that were closely located to the developers, like for instance, get a lab of build machines, let's take one of these cache servers in there. So the build machines all fetch from there and there you have quick throughput, small latency. And they don't have to bug the origin server for anything but the Refs, you do the same thing around the developers that solved a lot of our scale problems because you don't have these thundering herds of machines coming in and asking for all the data all at once. Utsav Shah: If we had a super similar concept of repository mirrors that would be listening to some change stream every time anything changed on a region, it would run GitHub, and then all the servers. So it's remarkable how similar the problems that we're thinking about are. One thing that I was thinking about, so VFS Forget makes sense, what's the origin of the FS monitor story? So for listeners, FS monitor is the File System Monitor in Git that decides whether files have changed or not without running [inaudible 32:08] that lists every single file, how did that come about? Derek Stolee: There are two sides to the story; one is that as we are building all these features, custom for VFS Forget, we're doing it inside the Microsoft slash Git fork on GitHub working in the open. So you can see all the changes we're making, it's all GPL. But we're making changes in ways that are going fast. And we're not contributing to upstream Git to the core Git feature. Because of the way VFS Forget works, we have this process that's always running, that is watching the file system and getting all of its events, it made sense to say, well, we can speed up certain Git operations, because we don't need to go looking for things. We don't want to run a bunch of L-stats, because that will trigger the download of objects. So we need to refer to that process to tell me what files have been updated, what's new, and I created the idea of what's now called FS monitor. And people who had built that tool for VFS Forget contributed a version of it upstream that used Facebook's watchman tool and threw a hook. So it created this hook called the FS monitor hook, it would say, tell me what's been updated since the last time I checked, the watchmen or whatever tools on their side would say, here's the small list of files that have been modified. You don't have to go walking all of the hundreds of 1000s of files because you just change these [inaudible 0:33:34]. And the Git command could store that and be fast to do things like Git status, we could add. So that was something that was contributed just mostly out of the goodness of their heart, we want to have this idea, this worked well and VFS Forget, we think can be working well for other people in regular Git, here we go and contributing and getting it in. It became much more important to us in particular when we started supporting the office monitor repo because they had a similar situation where they were moving from their version of source depot into Git and they thought VFS Forget is just going to work. The issue is that the office also has tools that they build for iOS and macOS. So they have developers who are on macOS and the team has just started by building a similar file system, virtualization for macOS using kernel extensions. And was very far along in the process when Apple said, we're deprecating kernel extensions, you can't do that anymore. If you're someone like Dropbox, go use this thing, if you use this other thing, and we tried both of those things, and none of them work in this scenario, they're either too slow, or they're not consistent enough. For instance, if you're in Dropbox, and you say, I want to populate my files dynamically as people ask for them. The way that Dropbox in OneNote or OneDrive now does that, the operating system we decided I'm going to delete this content because the disk is getting too big. You don't need it because you can just get it from the remote again, that inconsistency was something we couldn't handle because we needed to know that content once downloaded was there. And so we were at a crossroads of not knowing where to go. But then we decided, let's do an alternative approach, let's look at what the office monorepo is different from the windows monitor repo. And it turns out that they had a very componentized build system, where if you wanted to build a word, you knew what you needed to build words, you didn't need the Excel code, you didn't need the PowerPoint code, you needed the word code and some common bits for all the clients of Microsoft Office. And this was ingrained in their project system, it’s like if you know that in advance, Could you just tell Git, these are the files I need to do my work in to do my build. And that’s what they were doing in their version of source depot, they weren't using a virtualized file system and their version of source depot, they were just enlisting in the projects I care about. So when some of them were moving to Git with VFS Forget, they were confused, why do I see so many directories? I don't need them. So what we did is we decided to make a new way of taking all the good bits from VFS forget, like the GVFs protocol that allowed us to do the reduced downloads. But instead of a virtualized file system to use sparse checkout is a good feature and that allows us you can say, tell Git, only give me within these directories, the files and ignore everything outside. And that gives us the same benefits of working as the smaller working directory, than the whole thing without needing to have this virtualized file system. But now we need that File System Monitor hook that we added earlier. Because if I still have 200,000 files on my disk, and I edit a dozen, I don't want to walk with all 200,000 to find those dozen. And so the File System Monitor became top of mind for us and particularly because we want to support Windows developers and Windows process creation is expensive, especially compared to Linux; Linux, process creation is super-fast. So having hooky run, that then does some shell script stuff to come to communicate to another process and then come back. Just that process, even if it didn't, you don't have to do anything. That was expensive enough to say we should remove the hook from this equation. And also, there are some things that watchman does that we don't like and aren't specific enough to Git, let's make a version of the file system monitor that is entrenched to get. And that's what my colleague Jeff Hosteller, is working on right now. And getting reviewed in the core Git client right now is available on Git for Windows if you want to try it because the Git for Windows maintainer is also on my team. And so we only get an early version in there. But we want to make sure this is available to all Git users. There's an imputation for Windows and macOS and it's possible to build one for Linux, we just haven't included this first version. And that's our target is to remove that overhead. I know that you at Dropbox got had a blog post where you had a huge speed up just by replacing the Perl script hook with a rusted hook, is that correct? Utsav Shah: With the go hook not go hog, yes, but eventually we replace it with the rust one. Derek Stolee: Excellent. And also you did some contributions to help make this hook system a little bit better and not fewer bucks. Utsav Shah: I think yes, one or two bugs and it took me a few months of digging and figuring out what exactly is going wrong and it turned out there's this one environment variable which you added to skip process creation. So we just had to make sure to get forest on track caches on getting you or somebody else edited. And we just forced that environment variable to be true to make sure we cache every time you run Git status. So subsequent with Git statuses are not slow and things worked out great. So we just ended up shipping a wrapper that turned out the environment variable and things worked amazingly well. So, that was so long ago. How long does this process creation take on Windows? I guess that's one question that I have had for you for while, why did we skip writing that cache? Do you know what was slow but creating processes on Windows? Derek Stolee: Well, I know that there are a bunch of permission things that Windows does, it has many backhauls about can you create a process of this kind and what elevation privileges do you exactly have. And there are a lot of things like there that have built up because Windows is very much about re maintaining backward compatibility with a lot of these security sorts of things. So I don't know all the details I do know that it's something around the order of 100 milliseconds. So it's not something to scoff at and it's also the thing that Git for windows, in particular, has difficulty to because it has to do a bunch of translation layers to take this tool that was built for your Unix environment, and has dependencies on things like shell and Python, and Perl and how to make sure that it can work in that environment. That is an extra cost like if windows need to pay over even a normal windows process. Utsav Shah: Yes, that makes a lot of sense and maybe some numbers on I don't know how much you can share, like how big was the windows the office manrico annual decided to move from source depot to get like, what are we talking about here? Derek Stolee: The biggest numbers we think about are like, how many files do I have, but I didn't do anything I just checked out the default branch should have, and I said, how many files are there? And I believe the windows repository was somewhere around 3 million and that uncompressed data was something like 300 gigabytes of like that those 3 million files taking up that long. I don't know what the full size is for the office binary, but it is 2 million files at the head. So definitely a large project, they did their homework in terms of removing large binaries from the repository so that they're not big because of that, it's not like it's Git LSS isn't going to be the solution for them. They have mostly source code and small files that are not the reason for their growth. The reason for their growth is they have so many files, and they have so many developers moving, it moving that code around and adding commits and collaborating, that it's just going to get big no matter what you do. And at one point, the windows monorepo had 110 million Git objects and I think over 12 million of those were commits partly because they had some build machinery that would commit 40 times during its build. So they rein that in, and we've set to do a history cutting and start from scratch and now it's not moving nearly as quickly, but it's still very similar size so they've got more runways. Utsav Shah: Yes, maybe just for comparison to listeners, like the numbers I remember in 2018, the biggest repositories that were open-source that had people contributing to get forward, chromium. And remember chromium being roughly like 300,000 files, and there were like a couple of chromium engineers contributing to good performance. So this is just one order of magnitude but bigger than that, like 3 million files, I don't think there's a lot of people moving such a large repository around especially with the kind of history with like, 12 million objects it's just a lot. What was the reaction I guess, of the open-source community, the maintainers of getting stuff when you decided to help out? Did you have a conversation to start with they were just super excited when you reached out on the mailing list? What happened? Derek Stolee: So for full context, I switched over to working on the client-side and contributed upstream get kind of, after all of the DFS forget was announced and released as open-source software. And so, I can only gauge what I saw from people afterward and people I've become to know since then, but the general reaction was, yes, it's great that you can do this, but if you had contributed to get everyone would benefit and part of the things were, the initial plan wasn't ever to open source it or, the goal was to make this work for Windows if that's the only group that ever uses it that was a success. And it turns out, we can maybe try to say it, because we can host the windows source code, we can handle your source code was kind of like a marketing point for Azure Repos and that was a big push to put this out there and say in the world, but to say like, well, it also needs this custom thing that's only on Azure Repos and we created it with our own opinions that wouldn't be up to snuff with the Git project. And so, things like FS monitor and partial clone are direct contributions from Microsoft engineers at the time that we're saying, here's a way to contribute the ideas that made VFS forget work to get and that was an ongoing effort to try to bring that back but it kind of started after the fact kind of, hey, we are going to contribute these ideas but at first, we needed to ship something. So we shipped something without working with the community but I think that over the last few years, is especially with the way that we've shifted our stance within our strategy to do sparse check out things with the Office monitor repo, we've much more been able to align with the things we want to build, we can build them for upstream Git first, and then we can benefit from them and then we don't have to build it twice. And then we don't have to do something special that's only for our internal teams that again, once they learn that thing, it's different from what everyone else is doing and we have that same problem again. So, right now the things that the office is depending on our sparse Checkout, yes, they're using the GVFs protocol, but to them, you can just call it partial clone and it's going to be the same from their perspective. And in fact, the way we've integrated it for them is that we've gone underneath the partial clone machinery from upstream Git and just taught it to do the GVFS protocol. So, we're much more aligned with because we know things are working for the office, upstream, Git is much more suited to be able to handle this kind of scale. Utsav Shah: And that makes a ton of sense and given that, it seems like the community wanted you to contribute these features back. And that's just so refreshing, you want to help out someone, I don't know if you've heard of those stories where people were trying to contribute to get like Facebook has like this famous story of trying to continue to get a long time ago and not being successful and choosing to go in Mercurial, I'm happy to see that finally, we could add all of these nice things to Git. Derek Stolee: And I should give credit to the maintainer, Junio Hamano, and people who are now my colleagues at GitHub, like Peff Jeff King, and also other Git contributors at companies like Google, who took time out of their day to help us learn what's it like to be a Git contributor, and not just open source, because open source merging pull requests on GitHub is a completely different thing than working in the Git mailing list and contributing patch sets via email. And so learning how to do that, and also, the level of quality expert expected is so high so, how can we navigate that space has new contributors, who have a lot of ideas, and are motivated to do this good work. But we needed to get over a hump of let's get into this community and establish ourselves as being good citizens and trying to do the right thing. Utsav Shah: And maybe one more selfish question from my side. One thing that I think Git could use is some kind of login system, where today, if somebody checks in PII into our repository into the main branch, from my understanding, it's extremely hard to get rid of that without doing a full rewrite. And some kinds of plugins for companies where they can rewrite stuff or hide stuff on servers, does GitHub have something like that? Derek Stolee: I'm not aware of anything on the GitHub or Microsoft side for that, we generally try to avoid it by doing pre received books, or when you push will reject it, for some reason, if we can, otherwise, it's on you to clear up the data. Part of that is because we want to make sure that we are maintaining repositories that are still valid, that are not going to be missing objects. I know that Google source control tool, Garrett has a way to obliterate these objects and I'm not exactly sure how it works to then say they get clients are fetching and cloning and they say, I don't have this object it'll complain, but I don't know how they get around that. And with the distributed nature of Git, it's hard to say that the Git project should take on something like that, because it is centralizing things to such a degree that you have to say, yes, you didn't send me all the objects you said you were going to, but I'll trust you to do that anyway, that trust boundary is something that gets cautious to violate. Utsav Shah: Yes, that makes sense and now to the non-selfish questions, maybe you can walk through listeners, why does it need to bloom filter internally? Derek Stolee: Sure. So let's think about commit history is specifically when, say you're in a Java repo, a repo that uses the Java programming language, and your directory structure mimics your namespace. So if you want to get to your code, you go down five directories before you find your code file. Now in Git that's represented as I have my commit, then I have my route tree, which describes the root of my working directory and then I go down for each of those directories I have another tree object, tree object, and then finally my file. And so when we want to do a history query, say what things have changed this file, I go to my first commit, and I say, let's compare it to its parent and I'm going to the root trees, well, they're different, okay they're different. Let me open them up find out which tree object they have at that first portion of the path and see if those are different, they're different let me keep going and you go all the way down these five things, you've opened up 10 trees in this diff, to parse these things and if those trees are big, that's expensive to do. And at the end, you might find out, wait a minute the blobs are identical way down here but I had to do all that work to find out now multiply that by a million. And you have to find out that this file that was changed 10 times in the history of a million commits; you have to do a ton of work to parse all of those trees. So, the Bloom filters come in, in a way to say, can we guarantee sometimes, and in the most case that these commits, did not change that path, we expect that most commits did not change the path you're looking for. So what we do is we injected it in the commit-graph file because that gives us a quick way to index, I'm at a commit in a position that's going to graph file, I can understand where this Bloom filter data is. And the Bloom filter is storing which paths were changed by that commit and a bloom filter is what's called a probabilistic data structure. So it doesn't list those paths, which would be expensive, if I just actually listed, every single path that changed at every commit, I would have this sort of quadratic growth again, in my data would be in the gigabytes, even for a small repo. But with the Bloom filter, I only need 10 bits per path so it's compact. The thing we sacrifice is that sometimes it says yes, to a path that is the answer is no but the critical thing is if it says no, you can be sure it's no, and its false-positive rate is 2%, at the compression settings we're using so I think about the history of my million commits 98% of them will this Bloom filter will say no, it didn't change. So I can immediately go to my next parent, and I can say this commit isn't important so let's move on then the sparse any trees, 2% of them, I still have to go and parse them and the 10 that changed it they'll say yes. So, I'll parse them, I'll get the right answer but we've significantly subtracted the amount of work we had to do to answer that query. And it's important when you're in these big monitor repos because you have so many commits, that didn't touch the file, you need to be able to isolate them. Utsav Shah: At what point or like at what repository number of files, because the size of file that thing you mentioned, you can just use LFS for that should solve a lot of problems with the number of files, that's the problem. At what number of files, do I have to start thinking about okay; I want to use these good features like sparse checkout and the commit graphs and stuff? Have you noticed a tipping point like that? Derek Stolee: Yes, there are some tipping points but it's all about, can you take advantage of the different features. So to start, I can tell you that if you have a recent version of Git saved from the last year, so you can go to whatever repository you want, and run, Git, maintenance, start, just do that in every [inaudible 52:48] is going to moderate size and that's going to enable background maintenance. So it's going to turn off auto GC because it's going to run maintenance on a regular schedule, it'll do things like fetch for you in the background, so that way, when you run Git fetch, it just updates the refs and it's really fast but it does also keep your commit graph up to date. Now, by default, it doesn't contain the Bloom filters, because Bloom filters is an extra data sink and most clients don't need it, because you're not doing these deep queries that you need to do at web-scale, like the GitHub server. The GitHub server does generate those Bloom filters so when you do a File History query on GitHub, it's fast but it does give you that commit-graph thing so you can do things like Git log graph fast. The topological sorting has to do for that, it can use the generation numbers to be quick, as opposed to before printers, it would take six seconds to do that just to show 10 commits, on the left few books had to walk all of them, so now you can get that for free. So whatever size repo is, you can just run that command, and you're good to go and it's the only time you have to think about it run at once now your posture is going to be good for a long time. The next level I would say is, can I reduce the amount of data I download during my clones and fetches and that the partial clones for the good for the site that I prefer blob fewer clones, so you go, Git clone, dash filter, equals blob, colon, none. I know it's complicated, but it's what we have and it just says, okay, filter out all the blobs and just give me the commits and trees that are reachable from the refs. And when I do a checkout, or when I do a history query, I'll download the blobs I need on demand. So, don't just get on a plane and try to do checkouts and things and expect it to work that's the one thing you have to be understanding about. But as long as you are relatively frequently, having a network connection, you can operate as if it's a normal Git repo and that can make your fetch times your cleaning time fast and your disk space a lot less. So, that's kind of like the next level of boosting up your scale and it works a lot like LFS, LFS says, I'm only going to pull down these big LFS objects when you do a checkout and but it uses a different mechanism to do that this is you've got your regular Git blobs in there. And then the next level is okay, I am only getting the blobs I need, but can I use even fewer and this is the idea of using sparse checkout to scope you’re working directory down. And I like to say that, beyond 100,000 files is where you can start thinking about using it, I start seeing Git start to chug along when you get to 100,000 200,000 files. So if you can at least max out at that level, preferably less, but if you max out at that level that would be great sparse checkout is a way to do that the issue right now that we're seeing is, you need to have a connection between your build system and sparse Checkout, to say, hey, I work in this part of the code, what files I need. Now, if that's relatively stable, and you can identify, you know what, all the web services are in this directory, that's all I care about and all the client code is over there, I don't need it, then a static gets merged Checkout, will work, you can just go Git's sparse checkout set, whatever directories you need, and you're good to go. The issue is if you want to be close, and say, I'm only going to get this one project I need, but then it depends on these other directories and those dependencies might change and their dependencies might change, that's when you need to build that connection. So office has a tool, they call scooper, that connects their project dependency system to sparks Checkout, and will help them automatically do that but if your dependencies are relatively stable, you can manually run Git sparse checkout. And that's going to greatly reduce the size of your working directory, which means Git's doing less when it runs checkout and that can help out. Utsav Shah: That's a great incentive for developers to keep your code clean and modular so you're not checking out the world and eventually, it's going to help you in all these different ways and maybe for a final question here. What are you working on right now? What should we be excited about in the next few versions of Git? Derek Stolee: I'm working on a project this whole calendar year, and I'm not going to be done with it to the calendar year is done called the Sparse Index. So it's related to sparse checkout but it's about dealing with the index file, the index file is, if you go into your Git repository, go to dot Git slash index. That file is index is a copy of what it thinks should be at the head and also what it thinks is in your working directory, so when it doesn't get status, it's walked all those files and said, this is the last time it was modified or when I expected was modified. And any difference between the index and what's actually in your working tree, Git needs to do some work to sync them up. And normally, this is just fast, it's not that big but when you have millions of files, every single file at the head has an entry in the index. Even worse, if you have a sparse Checkout, even if you have 100,000 of those 2 million files in your working directory, the index itself has 2 million entries in it, just most of them are marked with what's called the Skip Worksheet that says, don't write this to disk. So for the office monitor repo, this file is 180 megabytes, which means that every single Git status needs to read 180 gigabytes from disk, and with the LFS monitor going on, it has to go rewrite it to have the latest token from the LFS monitor so it has to rewrite it to disk. So, this takes five seconds to run a Git status, even though it didn't say much and you just have to like load this thing up and write it back down. So the sparse index says, well, because we're using sparse checkout in a specific way called cone mode, which is directory-based, not path file-based, you can say, well, once I get to a certain directory, I know that none of its files inside of it matter. So let's store that directory and its tree object in the index instead, so it's a kind of a placeholder to say, I could recover all the data, and all the files that would be in this directory by parsing trees, but I don't want it in my index, there's no reason for that I'm not manipulating those files when I run a Git add, I'm not manipulating them, I do Git commit. And even if I do a Git checkout, I don't even care; I just want to replace that tree with whatever I'm checking out what it thinks the tree should be. It doesn't matter for what the work I'm doing and for a typical developer in the office monorepo; this reduces the index size to 10 megabytes. So it's a huge shrinking of the size and it's unlocking so much potential in terms of our performance, our Git status times are now 300 milliseconds on Windows, on Linux, and Mac, which are also platforms, we support for the office monitor repo, it's even faster. So that's what I'm working on the issue here is that there's a lot of things in Git that care about the index, and they explore the index as a flat array of entries and they're always expecting those to be filenames. So all these things run the Git codebase that needs to be updated to say, well, what happens if I have a directory here? What's the thing I should do? And so, all of the ideas of what is the sparse index format, have been already released in two versions of Git, and then there's also some protections and say, well, if I have a sparse index on disk, but I'm in a command that has an integrated, well, let me parse those trees to expand it to a full index before I continue. And then at the end, I'll write a sparse index instead of writing a full index and what we've been going through is, let's integrate these other commands, we've got things like status, add, commit, checkout, those things are all integrated, we got more on the way like merge, cherry-pick, rebase. And these things all need different special care to make it to work but it's unlocking this idea that when you're in the office monitoring who after this is done, and you're working on a small slice of the repo, it's going to feel like a small repo. And that is going to feel awesome. I'm just so excited for developers to be able to explore that we have a few more integrations; we want to get in there. So that we can release it and feel confident that users are going to be happy. The issue being that expanding to a full index is more expensive than just reading the 180 megabytes from disk, if I just already have it in the format; it's faster than being to parse it. So we want to make sure that we have enough integrations that most scenarios users do are a lot faster, and only a few that they use occasionally get a little slower. And once we have that, we can be very confident that developers are going to be excited about the experience. Utsav Shah: That sounds amazing the index already has so many features like the split index, the shared index, I still remember trying to like Wim understands when you're trying to read a Git index, and it just shows you as the right format and this is great. And do you think at some point, if you had all the time, and like a team of 100, people, you'd want to rewrite Git in a way that it was aware of all of these different features and layered in a way where all the different commands did not have to think about these different operations, since Git get a presented view of the index, rather than have to deal with all of these things individually? Derek Stolee: I think the index because it's a list of files, and it's a sorted list of files, and people want to do things like replace a few entries or scan them in a certain order that it would benefit from being replaced by some sort of database, even just sequel lite would be enough. And people have brought that idea up but because this idea of a flat array of in-memory entries is so ingrained in the Git code base, that's just not possible. To do the work to layer on top, an API that allows the compatibility between the flat layer and it's something like a sequel, it's just not feasible to do, we would just disrupt users, it would probably never get done and just cause bugs. So, I don't think that that's a realistic thing to do but I think if we were to redesign it from scratch, and we weren't in a rush to get something out fast, that we would be able to take that approach. And for instance, you would sparse index, so I update one file after we write the whole index that is something I'll have to do it's just that it's smaller now. But if I had something like a database, we could just replace that entry in the database and that would be a better operation to do but it's just not built for that right now. Utsav Shah: Okay. And if you had one thing that you would change about Git architecture like the code architecture, what would you change? Derek Stolee : I think there are some areas where we could do some plug ability, which would be great. The code structure is flat, most of the files are just C files in the root directory and it'd be nice if they were componentized a little bit better. We had API layers that could be operating. So we could do things like swap out how refs are stored more easily, or how to swap out how the objects are stored and it is less coupled to a lot of the things across the built-ins and other things. But I think the Git project is extremely successful for its rather humble beginnings, it started as Linus Torvalds, creating a version control system for the Linux Kernel things are in a couple of weekends or however long he took a break to do that. And then people just got excited about it started contributing it and you can tell, looking at the commit messages from 2005 2006 that this was the Wild West, people were just fast in replacing code and building new things and it didn't take very long, definitely by 2010 2011 to get code base is much more solid in its form and composition. And the expectations of contributors to write good commit messages and do small changes here and there have already been built at that time a decade ago. So Git is solid software at this point, and it's very mature, so making these big drastic changes are hard to do. But I'm not going to fault it for that at all, it's good to be able to operate slowly and methodically to be able to build something and improve something that's used by millions of people you just got it, treat it with the respect and care it deserves. Utsav Shah: If you think of software today as you run into bugs and so many different things, but Git is something that pretty much I think all developers use the most probably, and you don't even think of Git having bugs. You think, okay, I messed up using Git, you don't think that we'll get that something interesting. And if it turned out that Git had all sorts of bugs that people will run into, I don't even know what their experience would be like. They just get frustrated and they stop programming or something but yes, well, thank you for being a guest I think I learned a lot of stuff on the show. I hope listeners appreciate that as well and thank you for being a guest. Derek Stolee: Thank you so much it was great to have these chats. I'm always happy to talk about Git, especially at scale and it's been a thing I've been focusing on for the last five years, and I'm happy to share the love. Utsav Shah: I might ask you for like another episode in a year or so once like sparse indexes are out . Derek Stolee: Excellent. Yeah, I'm sure we'll have lots of teachers who had directions . This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Maju Kuruvilla is the CTO and COO of Bolt , a startup that offers quick online checkout technology to retailers. Previously, he was VP and GM at Amazon Global Mile, in charge of Amazon’s global logistics and Amazon Prime fulfillment operations amongst other things. Apple Podcasts | Spotify | Google Podcasts Highlights 00:30 - What does a VP at Amazon even do? The day-to-day experience of a VP/GM at Amazon. I think I’ve asked enough people this question that I finally have a vague sense of what these engineering leaders do (I think). 04:00 - Managing global logistics in one of the world’s largest logistics companies in the middle of a pandemic 09:00 - Shipping software quickly when you’re a large company. Two pizza teams with a twist. 16:00 - The role of software in global logistics. Amazon’s epic migration off Oracle databases. How to get thousands of people interested in migration or similar work. 25:00 - Launching Amazon Prime Now in 111 days (21 days more than what Jeff Bezos mandated). 38:00 - The complexity behind a checkout operation in an online store. Tax operations, compliance (!), and other complexities. 46:00 - A tech stack to solve the checkout problem. 51:00 - Building trust, relationships, and making an impact as an engineering leader in a new company. Everyone wants to hire great people, but what does that really mean? Transcript Utsav Shah: Welcome to another episode of the Software at Scale Podcast, joining me today is Maju Kuruvilla, who is the CTO and now CEO of Bolt. Previously, he was VP at Global Mile, which is an organization at Amazon that was in charge of global fulfillment. Thank you for joining me. Maju Kuruvilla: Thanks for inviting me, Utsav. It's great to be here. Utsav Shah: Maybe we can start with what exactly does a VP of Global Mile at Amazon do? How many people are reporting to you eventually and what is your day-to-day look like at that time? Maju Kuruvilla: So I had few different roles at Amazon, the last role was the VP of Global Mile. Before that, I was a VP of the Worldwide Fulfillment technology team and the difference is when I was VP of the Worldwide Fulfillment technology team, we were responsible for all tech and products that Amazon uses in our fulfillment centers worldwide and so that's kind of global responsibility. And then a move to Global Mile, which was a little bit more a general manager role, where I was responsible for not just the technology and products, but also the operations, the sales, all of the different components of that as an end-to-end business. In either of those roles were global had more than 1000s of engineers and a lot of product managers? And even when I was managing the Worldwide Fulfillment team and even hardware teams and networking teams, all of them are involved as part of the team. The difference in the Global Mile was more being responsible for a P&L and an entire business, it is a little bit different than the Global Fulfillment side and I'm happy to explain on either side but just want to put the difference out there, what's up. Utsav Shah: So maybe you can just explain both of those things, I'm sure you're looking at P&L and then going into specifics understanding of what's going on? What does your day-to-day look like? I don't know if you can talk about any specific projects that you did, I think that would be super interesting to know about. Maju Kuruvilla: When I took on the Global Mile role; first of all, what Global Mile is, before it used to be called Global Logistics, it's all the logistics that is done to connect between countries. So whenever an item moves between countries and has to cross a border, that's where global, logistics or global mile comes into the picture. Most of the items that we sell in the US or Europe comes from manufacturing countries like Asia or China. And so when we have to bring those items from there to the US, that's part of the global logistics role. So I took on this role right before the pandemic hit and the pandemic initially hit China, which I was responsible for running the China operations at that time. And then when the pandemic then expanded and spread out to the Western countries, and then the rest of the world, and running global logistics during that time was very challenging and exciting. At the same time, there are a lot of changes that got disrupted that even today are not restored to what it was before. So the global supply chain is still recovering from all the problems that have happened since the beginning of COVID. So since I took on the role, it was largely doing catch-up on how to do, what do we do with the China operations? How do we manage our people there? How do we get the operations up and running? How do we keep our people safe during this process? And then, when all the planes stop flying or the passenger flight stop flying, the majority of the cargo space was not available after that. So we had to figure out how to create more air capacity between countries and then there was this big backlog of things that were coming from China all over the world, all the way from the challenges at the beginning of the port in China to the port of Los Angeles. For example, you could see lines of ships that are waiting to get unloaded and just managing all of that process [5:00] and reinventing and figuring out how to solve the global logistics problem in the middle of the pandemic was the highlight of my time at Global Logistics. Utsav Shah: Sounds like a fun on boarding project. Maju Kuruvilla: It's hard to get trained for global logistics anywhere because Amazon does a lot of that. On top of that, dealing with that during a pandemic was certainly something I was not ready for but like with every challenge and every job; you just have to figure things out. And they've created a lot of great opportunities for innovations, we did a lot of things that we wouldn't have done at the speed at which we did if we were not presented with a constraint. So a lot of innovations came out of that, a lot of new capabilities came out of that, and it was great to have a strong team at Amazon, where everyone responded fast, and started building things fast and get [unclear 06:08] operational in a very short time. Utsav Shah: Are you at liberty to share any of those innovations that you talked about? Maju Kuruvilla: A few I can certainly share; one is the air capacity problem, like, around 45% of air cargo capacity comes from the belly cargo from the passenger flight. So when passenger flights stopped flying, you essentially lost 45% of the global air cargo capacity, it’s just completely gone. And so how do we recreate that, so we had to create, start thinking of running out on charters. So we started renting planes, like 747 to fly from China to the US and Europe, from the US to export countries and so we started releasing planes and started flying, which we never had before and that was something we created last minute and started flying. An interesting story there was we started even getting the cargo planes for lease became very expensive and because everyone was trying to do something similar, we started renting even VIP jets. At one point, we rented a VIP jet that even I have never been on but then we were putting packages on it and shipping all over the world. So creating that kind of air capacity in a very short time and then building that into a framework that can be used for longer-term was something amazing and quick we did and whenever we have to do that a lot of things need to come together. One is, do you need to have the right technology for us to organize all the different products at source, figure out the route at which we all need to go, and then how do we fill this capacity when on the airplanes. You put something called a ULD which is a box that you fill all the items in and that's what you load and unload. And so you have this complex math problem of how do we fill it because you want to mix the right amount of weight and cube so that you can maximize the utilization of that. And you just have to come up with all these algorithms quickly to maximize that and then you have to do the safety to make sure that the things that you don't want to get on a plane don't get on a plane. And the same thing, once it reaches the destination, how do we unload it, and then from there, how does it go into all the different distribution centers, so creating the technology for that into the end and also rocking the operational process, so we can run that end-to-end, and then making sure that the business is ready to run through all of that. Creating all of that in a matter of few weeks is the speed and scale sometimes we have to run with. Utsav Shah: First of all, I didn't even know that 45% of all cargo capacity is on passenger flights so that fact itself blew my mind. But it's also interesting that there's so much software involved to ship something like this out. One thing that always fascinated me is that even though Amazon is such a large company, it can operate and ship some things like these so fast, and maybe it's like a secret sauce, how does Amazon get that done? And we've all heard about the two pizza teams and all of that but if there's something else, like something that you've seen is a super important part of the culture. Maju Kuruvilla: One is what you just mentioned, which is the two-pizza team but the other aspect which at least I think allows Amazon to move [10:00] fast is the decision-making capability. So whenever at Amazon, we have to make a decision, we call it one word or a two word or decision. And one word or decision is something that you make a decision and there is no going back, so you have to be extremely thoughtful on whether you make that decision and decide to walk through that door. Whereas a duet or decision is a decision you can make and then if you don't like it, you can always walk back. And so for us to move fast, we have to create a lot of [inaudible 10:36] decisions, and then allow the teams and the people involved to make them because it's okay, if they make a mistake, they can always walk back. And decentralizing that decision-making, allowing people, enabling them to make that decision, and then giving them a framework where most of the decisions are just too [inaudible10:58] decisions where they can walk back so that it creates a very fast decision making. And execute and enabling people to make those decisions fast and then allow them to verify if this is working or not so that they can walk back. That kind of decentralized enablement is critical for companies to move fast and Amazon certainly takes advantage of that quite a bit, so people are not afraid to make decisions and people are not afraid to fail. Because the culture is that you can make decisions, and learn from it and if something is wrong, you need to work back from it, so long as you can do that properly, then that's great. And that is where I feel like most of the companies get stuck because nobody knows who will make a decision. Everyone is kind of marveling at the decision to a higher level. And there is somebody who's trying to make a decision that is not very close to all the action and even if it's a very good decision, it just takes a long time. So we say sometimes, a fast repair wrong decision made fast, might be better than the right decision made very slow as long as it's a two-word artist. Utsav Shah: Yes, that makes sense, and if you can walk through the example of renting airplanes, who would make that decision? Would a VP or a GM decide that we have to rent out this airplane, how would that bubble down just so that you can walk listeners through a project like that? Who would be making decisions, at what level? Maju Kuruvilla: Again, these are not qualified anywhere, per se and so whenever you make a decision, it is one thing to make a decision, the other thing is to notify and let people know, this is what we are doing and this is going to be the implications of that. In this scenario, it was brought up to me by the team itself, so they were like, capacity is down, and we got to come up with some of the new ideas. So they started coming up with options and now the challenge here was, whenever you create capacity in any supply chain if there is a little bit of a chicken and egg problem, do you create capacity first, or you create demand first. And when you build a supply chain, most of the time, you have to create capacity for us because if you create demand for us, then demand is just going to wait and you create a terrible experience for people. So you have to build capacity first and so in this particular case, it was more like, we didn't have enough demand to charter our planes. But if you can build it and create an infrastructure around that and make it reliable, are more people going to use it. And so it was a decision that people bubbled up, it was largely too for me to make that choice at that point. And I made a very strong recommendation to our leadership, and they were like, go for it and the decision were made in less than six hours, and then off we go. And now that we decide that we are going to double the capacity and we are going to figure this out, then it's all about execution. Now, that's only a small part of the decision-making, then there is the decision-making that happens every single day, like how many do we need, which days we need to run? And when at the beginning and the source and the destination, do you have the right operations aligned and which days you don't want to come and wait somewhere? And so how do you know that there's a lot of decision-making that happens at that level? And then it's like, what do you put on the plane, how do you prioritize? Do you prioritize an Amazon retail item? Do you [15:00] prioritize a seller's item, do we prioritize protective equipment for our associates. We even transported some of the equipment for hospitals, and some others just so that we can help during the middle of a pandemic. So, there are decision makings that happen at all levels. And the key there is not making one big decision right now that it's all about when you think it's in when you are working in a very fast-paced environment like this. There's a lot of micro-level decision making and if anybody hesitates to decide because they feel like okay, now somebody is going to beat me up, and I don't know what exactly it means, or I don't have all the answers, then you can move fast. But if everybody knows that, that's okay, I can't explain this. And it's fine. And I have no I and my everybody's going to understand or at least, and if I did something wrong, that's a great learning experience for me, then you can move fast. So it’s not a decision making at a particular level and what kind, it's enabling that a culture of freeing people from the fear of failure, and allowing people to focus on what are we trying to do? How do we achieve it fast? And anything and everything is available and possible to do that? It’s the culture of this decision-making that needed to move fast. Utsav Shah: That's interesting for me to hear and can you apply this a little bit to a large software project that you all did so that it's more relatable to people? We're more used to trying to ship a large piece of software? Maju Kuruvilla: Quite a few examples are coming to my head but one is a very complex and very technical project, even though the outcome is fairly simple. So Amazon has built the entire software on the Oracle Database platform over decades and then we were struggling and started running into constraints on Oracle and there were scaling issues because what could scale vertically, Amazon wants to scale more horizontally? So we wanted to move out of vertically scaling systems and getting more horizontally scalable systems? So the bottom line is, how do we move out of Oracle, and get to different platforms, whether it's dynamo DB or Aurora, or any of them some kind of a horizontally scalable solution. Now, doing this the entire Amazon had to go through it but for the fulfillment, which is one of the earliest teams at Amazon, it is a big ordeal. How do we now go and change your database where you have; first of all, fulfillment cannot stop right now, every day, you're fulfilling millions of units, and everything needs to move fast. But on the other side, you want to change the database, which you have been relying on for decades and this is not a small load, you're applying heavy load on this completely new technology and it's an all-in-one database. Everybody has their tables, and there are dependencies across all these tables, and how do we make it all happen in some kind of a sequential way and we want to get it all done in one year. So, this is one of those projects, where we thought it was impossible, and then we said, alright, let's go after it because our leadership all agree that this is the right thing to do for the company. And some teams were able to take their data and go and find new destinations for them and that was fine; that simple those are the 20% of the use case of that all locked. everybody else, there were a ton of features needed, some needed transactional support, that are dependent on the ACID properties on a database to kind of do whatever that feature they are trying to do. And there are also interdependencies where if this error is moved, then that service also needs to move because they share the same data and you cannot shift in different ways and this is a project. I spent a lot of time, a whole year working with every single team and you have to imagine I have around 1000 plus engineering teams, a few 100 individual teams, and hundreds of services that need to move out of this. And so this is why again, another one [20:00] where we decentralized decision making where we tell people that these are the things that need to be done and you can go ahead and get your pieces done and if you cannot, then you need to come back with a proposal, and how do we collaborate between the teams? And also, a lot of times there were decisions that need to be made where do we go from a relational database like Oracle to a completely no sequel database? Or we go from a relational database to another relational database like Postgres or an Aurora, kind of a data store? And how do we manage all of that journey? And again, this is where every team had to make hundreds and 1000s of micro-decisions on their side and they have to figure out how to collaborate across all the people. And then there were times, where some of those things were not happening like there was no plan. And so I still remember me and one of the Distinguished Engineer in the team, we know we had to tell people if something is not going to work, you need to tell us right now, you need to raise your hand because if you keep thinking that it's going to work, and if it doesn't work on last minute, we won't be able to help. So you need to ask us early enough that we can help but there are a few instances where we had to go and help. There are a few instances, we had to innovate and come up with completely new technology so that we can solve that by running that kind of an experience of moving the entire worldwide fulfillment to Amazon that has been running 20 years or so in Oracle. And in one year, moved them completely out of Oracle to a new platform and we got it done and it was no small feat and we got it run and scale just fine for a week and it was a huge outcome for the company. So I don't think running something like that in one year is something a lot of other companies may struggle to achieve, something like that in a short time. Utsav Shah: That sounds like a hard migration way or that you said that 20% of use cases are easy, like half of them will probably get done and there'll be so many stragglers with special cases and one-offs, it just sounds extremely painful to drive. How do you set up incentives in the right way to make sure that people wanted to do like because I'm sure a lot of it is grunt work and it's a little bit of an ordeal as you said? So one thing you mentioned was, you have a Distinguished Engineer, going to help everyone but people have so many competing priorities, they want to ship new things. How do you make it easier for these teams to prioritize this work? Maju Kuruvilla: It's all about prioritization, so you have two people look at what's important, you can say something is important but then people will pay attention to where you are spending your time. Right now, if people see that I as a leader or anybody who was a leader, spending most of their time on a project then people know that is important. It's not that they are saying it, they're doing it, and then provide with structure or guidance. For example, we created a small Tiger team and they knew how to go and audit every single team's plan, not just their migration plan but also their peak readiness plan and they will give out a report to me, and then I can review that. And if they are not ready, even if they think they are ready and if the team that's auditing did not feel like that's ready, they will come back with that and there are different mechanisms we put in place to enable teams to do that. Now, that example is more a hard groundwork of just migration, if I switch to a different example, which is the launch of prime now that is more exciting new work. Amazon wanted to get into fast delivery and there was this big plan that was created on how to do that and when debase [unclear 24:27], he reviewed a plan and he was, well, this is great, I love it go and make it happen in 90 days. So when you bait in your bases, you get to say things like that, go and make an entirely new experience of allowing people to get things within one hour in 90 days. This is said when same-day delivery is not even a thing, nobody would see our delivery is not even a thing and nobody know what it means. And so I was responsible for all the fulfillment aspect of that which turned out to be very complex, [25:00] how do we deliver something less than an hour when usually our fulfillment systems are designed to deliver in two days or less. So there is another one where we assembled a core team, we divvy up the function, and we decided to build, take some of the components of the fulfillment, created a lighter version of our fulfillment stack, and build. Some teams built the front end, new app, how the call connectivity? How do we manage payments? How do we manage fraud? And then, in the end, how do we even create technologies within the fulfillment center where you can pick and box all these things in a lightning-fast way. And then we launched that in New York, right in the middle of the city during Christmas, and I was there because we work hard on it and I delivered, we call it rideshare. You can go with the people who are delivering and see the experience for the customer and I did that for some of the initial ones and it was fascinating to watch the people's reactions when they get it. Sometimes we delivered things within eight minutes of when a customer first presses the order button and you can just see the stars and customers' eyes when something just shows up within eight minutes of something they ordered and mind you back in the day, these things didn't exist. So it was a magical experience for everyone; so we got it done. Not just my team but the collective team recording was done in 11 days, from the basis meeting, even though here that 90 days, we got it done in 11 days, and, and I don't even know how we got it done in under 11 days and basis, we're still happy but that's the speed at which you move. And if you need to move that kind of speed, you are innovating so many things; you are making decisions on so many items. But, it's not one person making all of the decisions, the entire team making those decisions, and allowing everyone to move faster. Utsav Shah: Just from reading one of the books on Amazon and understanding your stories, it seems the idea is to find each level so that it has enough people, so that's not what the blocker is. But then make sure there is fast decision-making and accountability. So projects get done on time, rather than not funding each level enough that it will just take forever to get stuff done because people feel they're super or something like that. Maju Kuruvilla: Well, whenever you have a big problem it's very hard to solve a big problem as is. So number one is let's agree on the problem and make sure that solving this problem makes sense for everybody and this is the right thing to do for the company, for the customer, and we have the right know-how to make it happen. And once we do that, then the next question is, how do we divvy up this problem into small chunks? If you want to move the mountain, it's extremely hard to move the mountain but assuming everybody can take a piece of rock from it and 1000s and millions of people if you can all assemble, then we can move the mountain. That's the concept of swarming a group of people around a huge problem not attacking the whole problem as one but divvying up the problem into smaller components. Now, divvying up the components is an art, you have to find out how to divvy it up because divvying up is not like, I will do this part or that part. Each of the smaller components needs to be fully functional by itself so that the team knows if they make it, they know how to test it, and how can you know if it's a car? Each of the components is like a tire and if you build the tire properly, you can have a tire team, you can have a wheel team and the good thing about that is, that team can continue to obsess over that and make it better every day. Today you can come up with some new tread wearing and all of that and tomorrow you can come up with some nails, but they have something they build that is complete in its sense as a component and that is something they can continue [30:00] to make better throughout their life, so we call that a pizza team. Or at Amazon, we call the pizza team and what that means is that team has ownership of a component that is complete and relevant [unclear 30:20]. So it's not a short term, as a project I will do this piece and that piece is more long-term ownership. And in the long-term ownership, I own this component, and I will continue to make that component better and I know how this component fits into all the other larger pieces but I don't care. So I am not going to worry about how the headlights are going to work or something, all those other things, I know it's all there but I don't need to worry about it. I'm not constrained by it because I built the right contract, if it's a tire, I just need to fit it on the right wheel and then I'm good, and then what happens after that I don't care. And I am going to obsess over this tire and all the materials that are on that forever and so we'll make it better over time. That's the concept, how do we break it down into components where each of the teams can own one of those pieces and then they can obsess over it? Every day, they have 24 7 365 days, they just tried to make it better and when a lot of components come together, you assemble to solve the larger problem you're going after. Utsav Shah: And I'm guessing the holistic vision on the assembly is management's responsibility along with the senior engineers and the principal engineers who will give guidance. Maju Kuruvilla: Yes, more than management it is that the senior engineers and the principal engineers, and a distinguished engineers because that's the way at least Amazon is structured is you have this pizza team that it's called two pizza team because suddenly, you should be able to feed a team with two pizzas. And so the magic number is somewhere between seven and 11, that's the number of people you should have in a team and that team should have everybody you need to solve the problem with that component. So sometimes the team may have hardware people, software people, data science people, it cannot be just a software team; it's a team, what do you need to solve that problem, you need all of them in there. And then whenever you have a principal engineer, their responsibility is to look at across multiple components like this, and see how they all come together and also the senior engineers in the teams, even though they are part of the team, they look across and negotiate and make sure that all the pieces are coming together well. But the rest of the teams, are headstand focused on what they need to build, and how to and they all have a metric that they are trying to make better and Amazon call it a fitness function. So you are looking at that fit fitness function and trying to make sure that you're continuing to make progress on that. Utsav Shah: And then that's some quality metric which is indicative of this team's component functioning properly or not. Maju Kuruvilla: It's more than that, this team's component being the best. So, it's not a functional metric, it's a fitness function means this is the most important thing that you can measure to see if they are doing the ultimate best, they can do. Utsav Shah : That's interesting, and then you can imagine that some people might even have an NPS score or something eventually getting tracked as part of that fitness function. And then finally, management reviews, all of these different teams is fitness functions to make sure that everything is coming together and then they can deliver the final large piece which is a huge prime now. Maju Kuruvilla: And usually there is a operate operational plan, so that that happens every year, you see how everything is going to come together and Amazon also has heavy documentation, culture, writing and reading are important. And reading is as important as writing by the way because I have seen a lot of companies tend to say we have a writing culture but most of them don't read. So I feel writing is one of those networking effects. If people don't read then writers have no incentive to write properly but Amazon has a very good doc in writing and reading culture and what that means is whenever you have to solve let's say, [unclear 34:43], another product was we did a computer vision system to completely automate the inventory, counting process in all fulfillment centers, it used to be a very manual [35:00] process. Before people used to count, it’s late, you're closing a store, you have to count the entire inventory all the time for compliance purposes, and also have to make sure that the virtual and the physical things are connected. And we completely get on with a new project where computer vision systems constantly monitor things as robots are moving things across the field and that can replace the whole process manual counting process. And again, a team of 11 people made that happen from an idea, they pitch to making it happen for momentum, Amazon's providing fulfillment centers worldwide, just loving people just made it happen. And it saves a lot of money for the company and automated a lot of processes but its more pizza team coming together. And so when you have to do that, the first things you write is what do you call a press release document or a PR FAQ document? And it's a one-page that clearly says, what's the problem? Why do we think we are the right people to solve? Is this the right problem to solve? And if you solve it, what will be the experience for the customer, so you have to write the experience for the customer, past solving this problem before you start doing anything. So that's what the press releases, you are writing down how the experience is going to be after the fact before you start any work, which is very powerful, by the way, because most of the time when you have to write that from a customer angle, a lot of things could become very clear, things we didn't think through will become very obvious because you might be solving either part of a problem or it's part of a bigger problem. And a customer only cares if the whole thing is off, not a piece of the result. So going to explain that customer experience is very powerful and then from that, there is a sequence of FAQs and usually, it's only one page, by the way. PR FAQ is just one page on the press release and then a lot of questions and that document is the first thing everybody writes and reviews. And so any of this brothers prime now, whether it's this computer vision-based inventory, counting any of that's the first thing people do, they write in a document, and then that gets reviewed by leadership. And then if someone reads and then finally says, approve, and you can go and start a new pizza team for that, or you going to allocate resources for that. And then you can go and do what there might be multiple pizza teams come together to solve that and you explain that in your plan. That's our plan, and then that's fine, you get a proof and then people will go off from that point, and make things out. Utsav Shah: So switching gears maybe a little bit, one of the things that Amazon is good at is making sure the checkout experience is super smooth, there's the one-click option, which I think Amazon has a patent on, so it's not easy for other companies to do something similar. But a lot of reasons why people like using Amazon is also the smooth checkout experience plus, it's super reliable. So just recently, I tried to buy a DELL laptop from Dell, and twice they just canceled my order and I had to place it through Amazon. I don't know how much Dell has to pay Amazon for it, but they just lost the commission on that laptop. So can you walk through why it's so important for checkout to be so seamless, intuitively it makes sense, you don't want people to be abandoning stuff in their car, but any numbers or anything whatsoever about why that checkout experience has to be as convenient as it is? Maju Kuruvilla: It's an interesting story and the whole online commerce story is unfolding in front of our eyes as we speak, this is the time for e-commerce. Companies like Amazon created this online buying experience and people started trusting buying online before people were worried about the security and safety of their staff and the quality of the things that they might get. Amazon solved all of that and got people to buy online without thinking twice about it. What we are seeing now is the consumers are used to it, now pandemic accelerated the whole process too, pandemic accelerated e-commerce adoption, almost 10 years ahead of the previous base. So if we didn't have a pandemic ended up taking another 10 years for us to be where we are, but it accelerated 10 years. So now people are [40:00] buying, especially the newer generation, folks buying online are not a big deal. They don't even think twice about it but what comes with that is people also want a different experience, people want to buy, they will continue to buy from Amazon but they also want to go and buy from other places. Because there are a lot of different merchants and brands that want to provide a very unique experience for customers. And people having that experience is not just about buying a product as it's about that whole experience of connecting yourself with the brand and that the experience of buying and then all the post-purchase experience after that on being connected with that brand. So sometimes that relationship is more than just buying something and more people want that, and especially the newer generations are more into that process. The challenge with most of those margins is that, how do you provide a simple seamless checkout experience like Amazon? Because Amazon has, like you said, an amazing checkout experience and how do you provide that because people are used to it now, so you don't need to beat Amazon on that. But at least people, you need to provide a similar experience that everywhere else and this is where companies like Bolt come in. I'll speak about Bolt a little bit here, where we can provide that checkout experience that people are used to, and make that one-click checkout experience where you come to our site, and you just click one button, and it's yours. And if you can provide that experience, people will start engaging with brands and merchants a lot more than if they have to go through the whole kind of high friction buying process. Because when you think of buying, it's a funnel and the first is research, and then there is a discovery of product, then there is intent to buy. That's why most of the time [unclear 42:11] on and then there is the conversion and at every point, you are adding friction, and if you can have where people can back off from the process. But if you can remove all of that, it's when you have to randomly a movie, and you have to go all the way to a blockbuster or stand in line, pick up a video and come home and watch it, that seems like a lot of friction versus just sitting at home and click Netflix and boom. People do watch more Netflix because of the ease of it, than like Hollywood or [unclear 42:51] Hollywood, or blockbuster or somebody like that because there's a lot of friction in that and the same people just do more because it's convenient. And that applies even in a checkout where we want to eliminate all that friction so that when people want their intent to buy is their conversion. There's nothing that stands between that and beyond just buying things on the brands and their website. There is also one more step beyond that. This is where most of the people are moving to and that is called social commerce where people want to buy things as soon as they see something, it's called buying at the point of discovery. So as you see an advertisement, you see a video, you will see an influencers media or you're reading a review, and you see something and you want to buy it, can you just click and buy it right there with one click, or it's a link that takes you back to some site and you have to buy it there now by going through all the process. So simplifying this whole process, whether that's buying from a website, or buying at the point of discovery at any surface, is going to be very critical in the future. And in fact, that's going to be the expectation for a lot of people as we move more and more into the e-commerce journey and that's what companies like Bolt provide out of the box for merchants so they can provide this experience to their customers. Utsav Shah : And maybe even talk to us a little bit about the impact of going through that funnel, if I'm a shopkeeper today who just set up my site without using Bolt or an optimized lead what abandonment rates and stuff would I expect? I know it's going to be different for each person, but I'm just trying to understand how much does that convenience factor play into online purchases? Maju Kuruvilla: It's substantial. So, it depends on the merchant, depends on the category, depends on the customer [45:00] and we have several case studies that we have listed on our website. But in some of the good use cases, you see an 80% increase in conversion, when you get a one-click checkout experience that's on the highest end. But it's very powerful because there are a couple of reasons; One, is it just the friction, you don't need to go and do something else to make it happen and number two is, it's just the safe, that people feel their safety and privacy. Do I want to give my information to all the websites out there? Or if I just give it to this one party I trust, and it insists that my identity is just there, and all my information is there but it enables me to log in everywhere, it's the single sign-on concept everywhere, that provides more comfort and safety for customers. So it's one side is the friction, the other side is feeling having a safe way to buy from anywhere. Utsav Shah : And I think one thing you spoke about was the fulfillment stack, building out all of this technology that provides fulfillment, maybe you can walk us through the checkout stack if you have to build a checkout system like this. At a glance, you're adding something to a cart, and you're retrieving that cart, and you're making a purchase but how does this work? How do you make it go one-click? And what are some challenges? Because I can imagine there are all sorts of things to worry about as users might click by mistake, there's fraud and stuff, how do you solve all of these problems? Maju Kuruvilla: Great question. So when you think of checkout when from the outside, it seemed like a fairly simple process, you add something to the cart, there's a checkout button, you click on it, and it asks you for payments and few things and it's done. Checkout is one of the most complex parts in a commerce stack and the reason for that is until it is checkout, it's just browsing, you are just adding things here and there, nothing needs consistency until at that point but when it's come to checkout, it's real. The checkout system needs to check for inventory, it needs to check for taxes, you need to look for coupons, it needs to look for the pricing at that time, it needs to look for all the shipping options, anything and everything that you need to check is all done at checkout. So checkout needs to call every single system that an e-commerce stack has to make sure that it all comes together so you can present it to the customer. What does it take to buy the thing and when they can get it and so it is an extremely complex function and so behind that scene, there is the UX how we provide a seamless user experience. For example, at Bolt, we obsess over how every single pixel works in that checkout? How does it work on the website? How does it work on a mobile? How can we make sure that it's so optimized that our customers feel the whole process like a breeze but then there is the payment gateway? People may want to use a variety of different payment instruments; payments are the whole world is changing so fast that there is some new payment method or an alternate payment method coming out, or every other week nowadays. So how do we keep provide merchants and integration into the entire payment world? And how do we place all of that so that the cost the customers can choose it in the right way, which is the other layer of complexity? But then there is the identity of the user itself, how do we do provide one click? We need to save that user's name and all the payment details and all the information so that we can do that. So, a company like Bolt, what we have is, we use that user in all the shopper's data as a network. So think of all the different shoppers who are connected to our shopping accounts network and everybody can contribute to that network and everybody can benefit from that network if they use Bolt checkout. So the Bolt checkout system is built on our accounts network that is continuing to grow and evolve and as people are shopping and all of these different merchants, people are getting added to this network. So as a new merchant coming brand new use Bolt, they have access to this whole network that's created through the shared network that is created so far, and they can provide one-click checkout to every single user in that [50:00] network. So what we are finding is this virtuous cycle of as we get more merchants, we get more accounts into our network and as we have more accounts in the network, you get more one-click checkout transactions, and then more one-click means more merchants want to sign up for us. So that virtuous cycle that's happening and that's added is accelerating, kind of Bolt's growth for now. But those are the different layers, all the way from a UI to all the complexities around that, but fundamentally built on this shared accounts network that's truly powering the one-click experience for everybody. Utsav Shah: So as an end-user, do I know I'm using Bolt when I'm buying something [unclear 50:46] website, or is it just like opaque to me? Maju Kuruvilla: No, you are buying it through Bolt; however, we don't try to brand it as a completely different brand because the merchant is buying us and we want to seamlessly integrate into Martin's ecosystem, so that we want to look more like, we are enabling the merchant, and we want to stay away between the customer and the merchant because we want to provide an experience as seamless as possible. However, for users they need to know that this is Bolt so that they can trust us, they know that it's powered by Bolt, creating an icon with Bolt, or logging in with Bolt, so that they know where it is, and they can control their information, they can manage it, and they can be at peace that we are taking care of that information and in one safe place. Utsav Shah: So that makes sense to me. Now, I'm curious to learn, you were running divisions or organizations at large companies like Amazon before, you can't just take every single good idea you have and apply it at a much smaller company. But what are some things that you feel you've changed in the first few months you were there and what are some things that you feel extremely strongly about? Even at a much smaller company when the Bolt is not super small, clearly, but what are just some learning’s that you feel you had to apply? Maju Kuruvilla: First and foremost, I would say, it doesn't matter what's the size of the company, you have to be connected with the customer and every single person all the way. For engineering, it doesn't matter whether you're an engineer, whether you are an accountant, or it doesn't matter what your discipline is, if you are in a company, your mission, you're passionate about it, you need to be very connected with your customer, you need to attend to some customer calls, you need to attend some support calls, you need to be on-call, sometimes, you need to be right in the thick of things. So big, small, doesn't matter and that I'm passionate about, and I'll force to make sure that everyone is deeply connected to that. And number two is hiring great people whether it’s; again, your company is only as good as the people you have. So hiring great people and taking care of great people is the highest priority of any leader across big, large mall doesn't matter; all companies. So, for example, when I came in, hiring more people into Bolt we are growing fast, so we wanted to bring in a lot of great people. I spend a lot of time in hiring and meeting with our people, for example, in my first 90 days, I met with every single person in the engineering and the product organization, had a one-on-one with them as simple three questions on what's working well at Bolt? What's not working well? And what are they hoping that I'll be doing to help? Three questions to every single person involved in engineering and products, I enter my first 90 days and I did that north of 100 interviews in my first 90 days, to both sides of the world, you have to take care of your great people and understand the people and what they want, what's working, what's not working so you can fix them. And continue to bring in great people into the company and that to me, knowing the customer and having a great team are the two things I hold dear to me, it doesn't matter where I go. Utsav Shah : Look, that makes a lot of sense to me but how do you [55:00] gauge the right people? How do you know somebody is good or not? Is there something in the interview you do or is this just a quality? You see, I don't know, I'm just curious, everybody talks about having great people, what does that mean? Maju Kuruvilla: There are two aspects where I look for personally, when we look for hiring great people, one is the basic table stakes, which are operational excellence, their technical skills, and all of that, which I think almost every company probably look for, which is great. But then there is the other side and this is where I will say beyond culture, I tend to look for people who are system thinkers, people who can think into it, people whenever they hear a problem, it's not just like, how do I solve this piece of the problem, but take a step back and look at, what's going on here. And what is the right way to solve it is maybe the solution is very different than what's obvious at the very beginning. And so people who can take a step back and look into and have that system thinking. And what that means is, truly solve the problem and sometimes not solving the problem you solve. But solving the original problem that even caused the problem, what you are seeing right now and so I tend to focus a lot during the interviews on things that like, what are some of the things they did? And I asked questions like, why did they do that? And why did that particular problem, the solution was the right one? How did they think that will solve whatever they're looking for? Because what I'm finding is that people can think more, can have that system level, an end-to-end, and end up solving innovative ways, than people just attack individual issues and solve them by themselves. And so I know, it's not that scientific way to think but it's a mindset, I have found that the people tend to create much resolving longstanding impact than most of the others. Utsav Shah: Yes, I think that makes a lot of sense, systems thinking framework, thinking people can understand the problem more holistically, than looking at the smaller things. But yes, I think this was a lot of useful information and we're almost on time. Thank you so much for being a guest, I certainly feel I learned a lot about a company that is pretty much in a black box, sometimes from the outside. I had no idea about so many things, about how fulfillment works, check out, thank you so much. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Bharat Mediratta is the first Coinbase Fellow . Previously, he was a Distinguished Engineer at Google, CTO at AltSchool , and CTO at Dropbox. Apple Podcasts | Spotify | Google Podcasts We focus this episode on the role of a senior technical individual contributor in a technology company and contrast that with the role of a technology executive (like a CTO) in a public company. We talk about how to explore what the right position is for someone, what drives their success, how to drive impact as a technical leader, and the trade-offs that have to be explored by both senior individual contributors and leaders. Highlights 01:00 - The story behind switching from a CTO back to an individual contributor. 08:00 - What does a director/VP/CTO’s day look like? How do they drive impact? 14:00 - If a manager has to make effective decisions and has input from individual contributors, why do they need to have a computer science background? Why do tech companies frown on hiring non-technical (MBA) leaders? 21:00 - A manager is certainly thinking about business priorities, and a senior technical individual contributor also has to think about business priorities. But how much? What’s their approach in driving towards business outcomes? How much backing does a technical leader need from management in order to be effective? 28:00 - When should Heads of Engineering think about hiring senior technical individual contributors? At what stage/what shape of problems should they be dealing with? 32:00 - The initial transition from senior engineering at Google to CTO of AltSchool 47:00 - For the next person exploring a move from IC to management: how to go about exploring the move? What should they keep in mind? Transcript Bharat [02:00]: Yes, Hey, thanks for having me back. I enjoyed our first podcast, it just reminded me of so much fun that I've had in my career doing great things, especially when I was at Google, and it's nice to be back. And yes, and thanks on the role, it seems maybe like an odd turn, I think, to go from being a Senior Executive Organizational leader at a big public company to being an individual contributor. But in many ways, for me, it felt like a very natural transition, and in fact, I know a lot of other people who have done this. I think to understand, like, what's going on here, you kind of have to wind the clock back a little bit, I spent most of my career doing this kind of arc where I would join a company as an engineer, I would become a technical lead, then I would become an organizational lead of some kind at the line manager. And then I would realize that I just didn't enjoy the things that I was doing on the organizational leadership side. And I would always be trying to steal time to work on the technology, partially, I think, because I was more comfortable with it. But also, I think what was clear is that it gave me a lot of joy and energy. And I would just then get to a point in these companies where I didn't want to keep going on the trajectory that I was on, but I didn't feel like it was acceptable to go back to being an individual contributor. So when I got to Google in 2004, I also once again joined as an engineer, and I committed myself that I wasn't going to become a manager, I wasn't going to become a team lead. And that lasted actually for about three years but I am not good at ignoring a vacuum or ignoring problems and at Google, there was this amazing opportunity to take on a team that was important to the company and lead it. And so I thought, I'll try it again maybe I've grown up, maybe things have changed and I'd say in 2008 2009, it put me onto like a 10-year trajectory of becoming a Senior Engineering Leader. And there were a lot of parts of it that I was pretty good at, I'm a humanist, I think relationally, I enjoy people, I enjoy getting to know people and understanding, I enjoy that aspect of organizational leadership. But there are also a large number of challenges to that role that Co-Op as your team gets larger, that take a lot of your time and energy, your primary tool stops being your compiler and your editor, and your primary tool starts being like Google Slides and email and meetings. And over time, I found that as I got better and better at being a technical organizational leader, I just found that I was enjoying my day-to-day less and less, and at first, I thought this is just [05:00] par for the course, this is just what you have to do to have a big impact and I felt like I did have a big impact. But I got to the point where I realized it used to be that I enjoyed my job, I didn't feel like I worked, I was blessed to be getting paid to do something I'd love to do and would do for free. And I kind of shifted more towards a model where I didn't love everything I had to do every day and in many cases, there'll be entire days, but I loved very little about what I did. And it was important and I felt the need and the value in it but I also was feeling myself slowly getting more and more demoralized, or less and less enthusiastic about the mission I was on. And after I left Dropbox, I did some soul searching; I thought to myself, well, why am I doing this? I'm at a point in my career where I have achieved many of the career goals I set out to achieve and I'm just not enjoying myself as much as I would like to. And why is that? I didn't feel like I had to do something, even though I didn't enjoy it anymore, kind of reached that level where it's okay to not necessarily work every day. And so I decided to myself, I will go see what I enjoy doing when there is no pressure and no structure and no money, but only time. And I started working on things that I was passionate about and I found myself diving into finance and I found myself advising a lot of startups and startup founders, and I found myself learning about financial instruments and learning a lot more about cryptocurrency. Now I've been a cryptocurrency enthusiast, since 2013 or so, when I bought my first Bitcoin on Coinbase. And so, I spend a bunch of time at the beginning of this year learning about DAX, and DeFI and how SciFi could shift to DeFI, and ultimately, I started getting attracted to some of the missions of some of these big cryptocurrency companies. And when Coinbase reached out, the conversation with them was kind of awesome, I chatted with Manish Gupta, the VP of engineering there, and he and I had a great conversation where he wanted me to come in and be an engineering organizational leader. That's where his need is, it's one of his significant needs. And we had this great conversation about what his calendar looks like and Manish is a very important, very influential figure Coinbase, in his calendar is wall to wall craziness. And I was like, look, Manish, that's not what I'm looking for right now and he's very smart guys he's like, well, what are you looking for, and when I told him was, I want to show up and learn from some of the best people in the industry who've been investing in this space and I want to understand how Coinbase is strategically increasing economic freedom. And I want to increase my skill set and then I want to find some hard problems at Coinbase and go apply myself there but I want to do it without the day-to-day burden of organizational leadership, the day-to-day burden of a stack of meetings, I want to be able to go back to my roots as a technologist and go find some big problems, and where better to do it than the company that's at the forefront of how these crypto exchanges are working. And one that's so innovative, and so together Manish and I hammered out this role and the role description, which for me feels like something which will bring me joy every day, and will give me time and space to go investigate. And it's funny I have been talking with lots of CTOs and SVPs events over the last few months as I head into this role. And every single one of them, when I tell them, this is what I'm doing they all get that look in their face and go like, wow, that would be so nice. They'll be like, so great they like, lay down my burden of what I'm doing I was just talking to a guy yesterday, he just like, wow, and that sounds like so much fun. Because when you're in it, when you're running a big org, it's big, and it's meaningful, but it's also so strategic, it's hard to accomplish anything on a day-to-day basis. It's hard to feel that tactical win, it's hard to be close to the technology, and you’re just getting pulled so much into the corporate strategy, the organizational leadership aspects of it. So this for me is a welcome break and a return to my roots and an opportunity to go learn from a great team and get to apply my skills towards something which I think does have a big meaning in the world. And do it as an engineer again, which is some of the things that give me energy, and who knows where it leads few years down the road. But for today, I'm excited to be there working on this. Utsav [09:39]: It sounds like a lot of fun. And the first question that I have is around for ICS like me who've never stepped into a management role. We can understand or empathize like what a line manager does, but what is a director, what is an SVP supposed to do? Why are their calendars full of meetings [10:00] every day? How are they supposed to drive impact were just like meeting people? I know that that's how they drive impact but what does success look like? Bharat [10:08]: Yes. Well, I think from the fundamentals, if you look at it, and you say, hey, let's say you've got a big company that spends $100 million a year on product engineering design, let's just simplify, let's just say you're spending $100 million in engineering. You'd want that 100 million dollar to go into the pockets of engineers, and you'd want those engineers to build great things but whatever that's like, 400 engineers, that’s a lot of engineers. And those engineers need guidance, career coaching, they need a little bit of hand-holding, they're going to be at different levels of experience, and they need to be aligned, and they need to be focused. So, they all need line managers, so if every line manager is having reports that are like 12%, or 13%, of your team then automatically goes into management. And then you still have like 50, to 60 managers then and they need managers, so now, again, rule of seven, and then you're still going to have like 20, directors and they need managers and the VPS, and so on. And so at every level, the line manager has to understand how to marshal engineering talent and get them focused on working the right way so that the majority of their work leads to a result. But if you go up a level, and you have a director, whose primary tool is a series of line managers, and the director’s roles to get those managers aligned, you've got these groups that need to work together. And for the most part, they're all heads down in their area so someone's got to work on that broader alignment. And so now you've got a director who has seven line managers, these line managers send in reports, you've got about, 60 70 people reporting up to this director, and then you have a VP who has five or six of these directors, and each of these directors is now wielding a pretty large amount of work product, roadmap influence, and you've got to get those directors aligned, and the farther you go up, the different the problem becomes. Now, your director is making sure that you're operating efficiently and reasonably aligned, so at Sun Microsystems used to say, tightly aligned, loosely coupled, so that everyone's headed in the same direction, but they're not current always jostling with each other because they're crammed in there. So, that's a different skill set at the director level, it requires you to be very thoughtful about how do you achieve this broader set of goals with this group of people, and you're always dealing with the problems that are bubbling up to you or the problems that your best leads under, you can't handle. And you handle the ones that you can, that you pass the rest up the chain to your VP, now your VP is not like, three 400 people. And the toughest problems there are bubbling up and your VP is now dealing largely with directors. And so your VP has a very kind of images that you're like, trying to play chess with boxing gloves on, you don't have the kind of fine-grained control, you have to be much more like, Hey, move this set of pieces over there, move that to the pieces over there, you can't reach in past the director, past the line manager and influence an individual, you have to be thinking and kind of these broad movements, and it just goes up and up enough. So you can kind of imagine how the skill set of a line engineer is very specific in their domain, the skill set of a manager is a completely different skill set around relationally, helping engineers develop and cultivating them to become increasingly better and growing. And then, of course, the skill set of a director is different, it's strategically getting larger organizations to work effectively together to achieve company-level goals. And the director has to sit in this zone where they understand what every line engineer is doing but also understand what the C suite members are saying when they start talking about business goals, someone's got to bridge that gap, and they tend to sit in that middle spot. Then you've got the VPS above who are essentially strictly organizational leaders, it helps that they're very technically if they have a good technical underpinning, but they are around figuring out what organizational ships do I need to make? What key goals do I need to set? What top-level decisions do I need to make to get the team to move effectively? The further up the chain you go, the more your job becomes to be an efficient decision factory. As a line engineer, your job is to produce code quickly. As you go up, you have to make tougher and tougher decisions and that's a very different skill set. So over time, people, I think get opportunities to move into management to apply some [15:00] of these things, but not everyone loves it because you have to make some tough decisions. And the decisions are generally along the lines of deciding what you're not going to do. It's very easy to make a non-decision and do a little bit of everything but that way leads to mediocrity and failure. The toughest thing for a leader to do is to hold a line and say, we are not going to do this, I know what's attractive, I know it looks lucrative, but it's not what we're doing, we may come back to it later, we are instead going to marshal and focus our energies on this set of things over here, which are small and focused, we can get it done and it will move the needle. So, as you think about people moving up that arc, they have to accept that their primary tool will change, their majority motion will change, it'll stop being you talking to the computer, it'll be you talking to other people. And then their primary set of outcomes and accountability will change too, because it will shift more and more towards what the business needs, versus what the technology can deliver. Utsav [15:58]: Yes, that makes a lot of sense. And how in-tune do you need to be with technology itself? Like, we often say that there has to be somebody with like a serious background and stuff to be all kinds of like managers, directors, VPS, but they're going further and further away from technology. But why is it still important for them to understand, as the technology or is it important? Bharat [16:21 ]: I think it is important, because they say in the army, when you become a general and you get that star put on your lapel, it's the last time people tell you the truth. A friend of mine, the USGS final telling me is similar and leadership in different ways, the more you go up the leadership chain, the more levels there are between you and the engineers in the ground, the more the messages from the engineers in the ground are filtered and interpreted for you. And they kind of have to be because the engineer on the ground doesn't have the necessary broad context. They just see like, hey, this thing is working this thing isn't and they may be complaining, hey, we don't have enough resources to finish this project and this projects important. But you as a leader might know that while that project is important, it's not a business priority, and you're going to defund it and D prioritize it, you're going to make a tough decision. And so you have to have some level of filtering but the problem is sometimes the real signal is thrown out in that filter. So as an engineering leader, it's helpful to know the base level technology and the fundamentals, such that when a message comes up to you, you can give it the sniff test, does that sound right? When people are talking about procurement or Dropbox, we've talked about procurement for hard drives. When you know, the size of the hard drive? And then you ask yourself, well, if we're trying to store zettabytes of data, why do we need 5x petabytes of storage? Well, then you need to start like that's the kind of thing whereas a senior leader, you have to start thinking about, okay, well, how much of it is lost to the operating system, how much of it is lost to shredding, and redundancy, how much of it is lost to disaster recovery, etcetera, what's a reasonable multiple? So you have to have some sense of it in your head so you know, that 3x to 4x is reasonable, but 10x is unreasonable and so the more that you understand those things, the better. That doesn't mean that you have to be like, a hard drive firmware software developer somewhere in your path, it does mean you need to commit to understanding the fundamentals of the space that you're in. So, I would say that it's a bit of a false dichotomy to say that you can't be a good engineering leader unless you have been an engineer in the past, I would say, for you to be a good anything leader; you need to understand the fundamentals of the domain that you're leading. And when you go further and further up in the chain, you tend to start taking on things that are related. So for example, I was an engineering leader at Google but then I took on capacity planning. Now, capacity planning is related to engineering but it's a very different field where you have to understand supply chains and fabrication cycles and duty cycles and capitalization cycles. And so there are a lot of things to learn and I found that when I took that on, I had to go hit the books and learn a huge amount of stuff to like, be effective. And I think that's just par for the course, that's what you have to do. Utsav [19:22]: Okay, yes, that makes sense. Now, you mentioned the job of, a director or a VP is just being an efficient decision factory. You're making tougher decisions at every level and you can also see how that's immediately impactful. Somebody's doing a project but you decide that you're not going to fund this project; you're going to fund another project because it's a higher business priority and that makes total sense. Now, when you transition to being a senior IC or when you are a super senior IC, how do you make a similar amount of impact? And is it even possible to make a similar amount of impact [20:00] what changes, at least in that impact sense? Bharat [20:05 ]: Yes, it's a good question, one argument is that to have a big impact, you have to do more than what you would do with your own two hands, you have to marshal a team and that usually speaks to leadership or management. But in the technical realm, I think you can have a very large impact by setting technical direction. If you think about the role of a CTO, it is to understand the arc of the world, the understand the needs of the business, the arc of the business, and then to figure out where those things intersect, where do the challenges of the world? How far out do we have to go for our new technology to satisfy that? There's an intersection point out in the future where you're like, hey, if we proceed along this trajectory, with this many people building this technology, then by this time down the road, we solve that world problem, it becomes monetizable, it floats the business, we're in a good place, and the role of the CTO among other things, to find that intersection and figure out well, how can we bring that in sooner? How can we do it six months, sooner, or a year sooner? Because in many business contexts, being able to deliver it six months sooner is all the difference between a successful business and that is carved out, no real market niche, and the business in second place that just can't quite keep up. So, you kind of need to understand how all those pieces fit together to do this, and very often, that requires you to be deeply invested in technical strategy, and setting the technical direction. So, in a large company, like Google, or Coinbase well, there are large companies like Coinbase, massive companies that Google. There's a lot of opportunity for impact by being a technical leader, who understands how all these pieces fit together and can start setting technical strategy that accelerates the business towards where it needs to go. And especially for companies Coinbase scale, I think there's just a lot of opportunity still to focus on this, you get to the scale of, say, a Google or an Amazon or Microsoft or Facebook. There are a lot of things that are so big and so deeply entrenched, that it's quite difficult to like, make big technical changes. So, one of the things that excited me about Coinbase is there's still a lot of time and room and space to make that happen. Utsav [22:31]: Yes, I think that makes sense. And as somebody who's setting technical direction, how much are you thinking about the business priorities? You probably are, but how much do you weigh that in versus other things like organizational alignment? Bharat [22:47]: It's a good question. Well, it's an interesting question, because I'm kind of going back to this role after a long hiatus. So, one answer would be like, hey, ask me in a month, two months, when I've had a chance to find my footing. I think that what will likely happen is that I will partner with the organizational leadership, so I've got the time and the boots on the ground to go understand some of the technical challenges, I do not have the double-edged sword of organizational leadership which will simultaneously give me the autonomy to go do some of the work myself, but also would come with the drag of having to maintain that organization and lead it. So my sense is, how this will work out is I'll partner with Manish and I'll keep him informed of what I'm seeing. And I'll start making suggestions and proposals and shifts and adjustments to what we're doing now because he will have the ability to shift the organization in certain directions but it's a little too soon to tell. Utsav [24:00]: That makes sense. Is it generally true, it seems to me that as senior IC, you need some kind of air cover and backing from senior management to drive maximum impact, you can have all of the ideas in the world, but unless they get funded, or they're discussed, and there's somebody who's saying, you know what, we should fund some of these ideas. There's not that much impact we can make. Bharat [24:29]: It is always beneficial to be able to bring leadership along with any decision that you make, I would say it's not just leadership, it's the whole organization. No engineer is in a vacuum, no engineer has the kind of authority and frankly, it takes time to build trust and credibility. I'm stepping into a space where there are a large number of very smart people who worked [25:00 ] hard and understood the complexities of the problem space. Most systems are sophisticated solutions to complex problems and it's easy to mistake that sophistication for unnecessary complexity if you don't understand the problem. So step one in all of these cases is to go understand the problem, go get a sense for what they're solving and why, go understand the solution, and then try to understand where is it correct in terms of its sophistication? And where it could be adjusted, where it's maybe a little too unnecessarily complex? Or maybe it could be simplified? Or maybe we can re-evaluate the invariance of the system? What was the design for? What is it solving? Are those things still true? Are there ways to make shifts, and in all cases, that's a dialogue, that's a understand frame out the problem, get people into a room, and have them understand it? And my sense is generally, the right solutions, there are emerge it, you converge it and the things that hold you back, are essential when it's just going to be prohibitively difficult, or maybe people have invested in certain areas, and it feels like too big, correction, there's always a level of change management to get from the trajectory or on to a different trajectory. I'm not so worried about that, my sense, and one of the reasons why I've enjoyed the process of getting to know Coinbase is that I've found them to be the folks I've talked to, and I talked extensively to folks in the team before I joined. I found them to be very open to this and very humble and very excited about the prospect of making some changes and looking to bring in some DNA from the outside to go focus on some of the challenges they have. And frankly, I'm excited to just get people to collaborate with on this. So, I'm not too worried about that, I'm not worried that, I'll show up, I'll have a bunch of great ideas, and no one will listen. I am honestly more concerned about getting the time to be able to invest in learning and understanding the great work they've done already. So that when I get to the point where I'm thinking about changes to make I understand the problem and the solution. Utsav [27:26]: Okay, I think that makes sense. And when you step into such a role, is there a particular team that you join? Are you reporting to the same line manager that other engineers are reporting to, how does that work? And what do you think makes sense, I guess? Bharat [27:42]: So, I begged for a bit of an exception to the normal case where you would join the team and work on that problem. And the reason why I begged for this exception is that I felt that one of the advantages to bringing in someone as a more Senior Technical IC is to give them the time and space to go make their assessment. And so that, I asked for that it's a privilege and I appreciate the consideration that Coinbase offered to let me go have a little flexibility to make my assessment. And then I will be in a position to have a real discussion with the leadership team about what's the right thing to work on, I've found, Manish, and the rest of the team to be very thoughtful about creating that space. And to be honest, when you're stepping into a company that's been doing work like this for eight or nine years, it's difficult to be able to figure out this stuff quickly. If even people way better than me, and there are people way better than me could come in and figure it all out in a couple of months, then they haven't gotten nearly far enough. For a system this large, this complex, it's a $50 billion company, that's one of the top crypto exchanges in the world in an emerging space, that's been doing a huge amount of pioneering and is aggressively moving. There's a tremendous amount of sophistication that solution they have, it just takes time. So, I sense that if I was going to do a more conventional roll out of the gate, I would just be picking one thing now and going and focusing on it but I asked for their indulgence to let me take a step back and look at it from altitude. And spend the first quarter or two doing that, so that I could be more productive down the road and I view this as a long-term engagement so, and so there's Coinbase. So, it makes sense to carve out some time upfront, because it becomes a very small percentage of where I and hopefully many other people who follow the same path can add value over time. Utsav [29:50]: That makes sense. Let's say that I'm ahead of engineering and I'm thinking about, okay, I have a bunch of line managers, let's say it's not a large company. Like [30:00] 30 engineers, 40 engineers. At what point should I be thinking about bringing in or even having, technical leaders or Senior Technical IC, does it just not make sense until you have an extremely complex application? Does it not make sense until you have 200 engineers? What's the framework you should be using to think about this? Bharat [30:23]: What is the value of Senior Technical engineers? Well, they're experienced so they've kind of, they have seen things that work and don't work in the past; they have enough understanding of the fundamentals of prior different situations. Very often, one of the real roles of technical leadership is deciding what don't do now what you do, like pruning bad paths from the decision tree, which saves you a ton of time. And so I would say that if you're the head of engineering, and you're the work that you need to do, and the operating specifications of it are well defined, you may not need, you may just need people to go execute against your roadmap, and which is, by the way, hard enough to find in today's industry. And that's not to say that you're not getting a talent, you can always use a talent, I think you should always be aiming for a talent. But it's more in a situation where you're trying to do growth into an unknown area, you're trying to go up by an order of magnitude, you're trying to move into a space, where you don't necessarily have the skills in house, most companies, if they 10x, a couple of times rapidly get into a space where they simply don't have the engineering talent in house to do this. Because when you take a product and you make a 10x, larger, at Google, we always used to say, when you make a problem 10x larger, it ceases to be the same problem, it becomes a very different problem it becomes about the scale that you're moving into. So, if you're a company that's trying to grow rapidly, or pushes into new domains, or understand things differently, it's helpful to have a deep bench of senior technical talent who can go off and investigate, building complex systems. That's not the kind of thing that you would normally conventionally put on the roadmap, you need people who have the time and the space to go explore and come up with new and innovative ideas. So I do think that like, it also depends on, are you a product company? Or are you a platform company? Are you a technology company? So for example, Google fundamentally was a technology company. And so they invested in technologists to go explore all kinds of spaces and then they were good at recognizing odd this will be useful when we 10x again, this will be useful when we 10x again, let's keep these things, let's double down on our investment, some of these spaces. As a result, Google was always well prepared, when we grew, our systems and our technology could keep up. Many companies when they grow in the scale, it's hard for their systems and technology to keep up because they don't have that deep bench of people who's already been kind of ranging out in front, fairing out new things to do. And so then when their growth kicks in, they wind up in a scramble to keep up with that growth. Dropbox was great at Dropbox grew fast in you know, early 2010, 2011, 2012. And they had a world-class technology team that was able to stay ahead of that growth curve, which is not easy, and an event investing and a lot of people to go build large scale systems ahead of the demand. The problem is if you build those large-scale systems ahead of the demand, and the demand doesn't show up, then you've wasted that investment. And if you don't build ahead of that curve, and the curve does show up, then you're going to fall over so it's a tricky calculus. Utsav [34:02]: I think the whole idea of you have to think about how much ambiguity you have in your roadmap and that helps you inform that decision makes a ton of sense to me. Could you share publicly the kind of things you'd be thinking of at Coinbase? The kind of maybe projects not specific projects, but just the kind of work you're thinking about, like infrastructure work, exploring new things, you think it's just off the table? Bharat [34:33]: Well, I want to be respectful to Coinbase and not divulge any of the cool things that they're doing behind the scenes. So, let me just say this I feel comfortable with a pretty broad charter and I think all of those things are on the table for things that we could talk about. They're all areas where Coinbase I think, is growing rapidly and innovating and so, I currently I'm not trying to narrow [35:00] my options down right now. But let's talk in a couple of months after I've gotten my boots in the ground, and then we could see what I could share with you. Utsav [35:10]: Cool. What does it mean to be a CTO of a small company or start-up when you started with old school? That's such a different role from being a super senior IC Google and what mindset shift did you have? Bharat [35:26]: Well, it's tricky, because an old school we're solving a humanity scale soft problem I think it's deceptive to think that you can solve these types of problems with technology. Technology can help but technology is the tool, and technology has a wide range I mean, like, banging, flint, and steel together in the woods to light a fire is a technology. Rocket fuel and fire to put a rocket on Mars is also technology but if you're lost in the woods, you don't want to Falcon nine, you want to Flint and Tinder and steel, you want the right technology for the job. So, part of the role at old school was, well, what is the right technology for the problem and a lot of what I did at old school was understanding the problem to get a technology solution that made sense for where we needed to be over a certain period. And I would say that, unlike many Silicon Valley companies, the old school was technology second or maybe even technology third, after the pedagogical model. And after the curriculum design and the content design, and the operational model of running schools, then you have the technology, it's kind of scaffolding and a structure and a utility, that helps make things easier. And so, for a CTO of a small company, I do think you want to be thinking very carefully about what is the problem to be solved? Where is the correct place for technology to be injected, such that technology supports and drives the mission and the business? And then how do I Marshal the correct team to go get it done efficiently? And very often, that means, a buddy of mine just joined a very small company, the CTO, he's probably going to start by looking at the problem assessing where it's going to be say, the scope of the problem today. Let's say you have 10,000 users today, and you think that 18 months from now, you're going to have 100,000 users, and three years from now you're going to have 300,000 users? Well, you don't need to build a system anytime soon that can support a million users. So don't go chasing massive scale efficiency, start thinking about, well, what's the operational characteristic of your users? Are you high transaction volume? Is it okay to be lost? Some systems care, a huge amount around integrity, some systems care, usually around response time. You start thinking to yourself, what are the different decision points you can make? And then you can start getting into driving decisions like, well, okay. Are we a product or a platform? I was talking to a CTO recently and my big question for him is like, if you cannot answer whether or not you're a product or platform, you don't know where to marshal your resources. So very often, what the early-stage CTO has to do is to go figure out, okay, what are we? And what technology do we need to have? And what characteristics do we need to have by 18 months to 36 months out? And how do I build out an architecture that is neither too much nor too little, that achieves the goals of the business in a way that makes sense? Some businesses need to be high volume, very fast response time, some are pretty low volume and can have a reasonable risk, medium response time some can be heavier, some to be lighter, some need to be models, or you could iterate quickly, we need to be able to push a feature every couple of days. Some of them are the types of systems where you don't want to push features every couple of days. Your users want stability, they're like, utilities, they expected not to change all that often, and they prioritize stability over rapid injection of features. So very often, the CTO needs to decide, okay, once I know what those characteristics are, then I can make architectural decisions, then I can start thinking about what frameworks or technologies I need. Do I want to be native on mobile? Or do I want to be cross-platform on mobile? All of these decisions, I think an early stage CTO should be thinking about in the vein of where are we today? Where are we going to be before our next rates? Because very often, what you're trying to do is get to the point where you raise the next round of money. Now, if you're the kind of business where you [40:00] don't necessarily need to raise a bunch of money, and you're looking at a longer-term arc and you're thinking, okay, well, we are going to run this, and we are very confident that we're going to get to 100 million users, which is a large number. It's not a billion-plus user, like at Google scale, but most companies have to get to 100 million users or over the million static 100 million users at a certain transaction volume, you got to figure, how many actives you're going to have? How big does your serving stack need to be, etcetera? So very often, the early-stage CTO is, got to be thinking about it in terms of stages and tears, and figuring out at every stage what you need, and then building out a roadmap to get there. So, we're going to do these architectural decisions in the early stage, and then when we start pushing past some of these thresholds, if our users get more active, or we get more users, or we're trying to monetize, then we're going to make this series of different decisions, and we have some decision points. And that's hard that requires discipline, requires hygiene, it requires a lot of careful thought, on the part of an early-stage CTO, and you know that CTO is also probably coding, and probably working very closely with the engineer and understands all the engineering challenges and solutions. And that CTO is also working closely with the CEO and their other peers and business partners to make sure that they understand where the business is heading and what's coming. That way, the CEO is not off selling products that engineering can't build and engineering is not building the wrong things, because they didn't understand what the CEO's vision for the company is. So, you kind of sit in that spot and it's tough, you spent a lot of time as an early-stage CTO, running around, dealing with a bunch of fires. But you wind up being probably the only person who can translate what's happening in the business down to the line engineers. Utsav [42:04]: That makes sense. One of the distinctions you spoke about was being a product versus a platform, why is that important for a CTO or even somebody like an engineer joining that company to know? Bharat [42:17]: So imagine you're the CEO, and you've got this app that you built, and you're going and selling it. And your customers like, this is awesome we would like you to build a version of that app for this slightly different business. The CEO is like, well, great we've got this one app, we have a team that's built it, how hard can it be, I'll go make a deal to sell it slightly differently to this other adjacent business. Now you go back to the team, and the team has two choices. If the team has built the app like a product, they don't have an easy way, the app is very tightly tuned, they've made a whole bunch of assumptions about what they're building the domain, and those assumptions are valuable. Because by making assumptions, you simplify your code, your code does not have to have a lot of abstractions, it does not have to be super flexible, it does one thing and it does that thing well. But now if you want that code to do something different, it's quite hard, because you have to go and find all the places where you made those assumptions, which are time-saving and you'd have to generalize them, you'd be like it could be this or that, and then you start getting into really weird edge cases. So it's difficult to take a specific product, and then transfer the value of it over into another specific product, you kind of have to rebuild the new product from scratch. But if you build a platform, then most of the value is in the platform and the app itself is merely a conduit to the value in the platform, using, say API. Now the downside to this is if the CEO comes back and wants to add a feature to the product that can be harder because you've got to go add the feature to the platform, then you got to come to add the feature to the product, you've got to plumb between the app and the platform to get it to work. And it's probably not exactly what the CEO wants, because the platform itself has to be generic enough to be flexible and the app is kind of doing a translation there. So, you lose a little bit of that like hyper-specific, amazing app specificity but if the CEO then goes and makes a sale into an adjacent space, you're then writing a new app, which sits on top of the same platform, and you get to translate much of the platform value over into the new app. So if things like identity, and storage and networking and configuration, and basic UI modules, all those things are part of your platform, then reassembling those Lego pieces into a whole new beast is not so difficult you can do it quickly. It will also be not as hyper-specific, but it will also be a lot more stable and reliable and you get a lot of reuse in your engineering team. So, you're [45:00] always making some of these tradeoffs and it's important to know which ones you're making at any given time because that way if your CEO is about to go into making a sale, you want them to understand, how difficult will it be for your team to achieve the result. If you build the product, and the CEO is now excitedly selling that product in a way that was not designed to support as the CTO, you want to be able to give guidance and yes, we can do that, if it's good for the business, we'll do it. But because of certain assumptions we made along the way, it will take you a non-intuitively long period to make that happen. And that's where the CEO needs to understand, okay but if I shifted a little bit and sold something slightly different, instead of selling to this customer, I sold to that customer and I got similar value, but much easier, lower hanging fruit in terms of cost, then I get a huge benefit that way. So, very often you as the CTO become the translation layer and that's why it's important to start asking yourself these questions, so if you're a CTO upfront, I would be talking to the CEO, what kind of sales model do you want? What things are we willing to trade off, explaining to your partner CEO, the difference between a platform and a product important, so that you can kind of co-create the solution together? You can say, okay, yes, we're going to be a platform because we anticipate having to be flexible about our domain and we're willing to trade off some hyper specificity in our app, versus saying, hey, no, we're going to be an app, because we're going to be a product because we don't believe we're going to shift domains and we want the hyper specificity in our product and we want to make a bunch of simplifying assumptions upfront. But these are all things that you need to negotiate with your peers so that you don't wind up kind of making bad assumptions going down a path and then finding that the business needs necessitate you doing something which you kind of ruled out based on early assumptions, and then you have to go back and do some very heavy lifting to get back on track. Utsav [47:12]: That makes sense. And in fact, I feel like it can be applied to things within the app, parts of the app could be more platforms is that you can easily make shifts there. Bharat [47:21]: Right. And by the way, it's a very common issue among engineers, when they don't know what they're going to be asked to do to build more generalized systems. It's like you're building your first app, and you have to build an authentication system. One approach is to do something very, very specific, hard code everything because this is the only time you're ever going to do it. But very often, people are like, well, I might get called on to reuse this in some other way so, I will bake in an abstraction layer at this moment. And when you bake in that abstraction layer, you are future-proofing the system, but it comes at the expense of a bunch of additional work every time you use it. And so, you potentially win big under certain circumstances, but you pay a tax every day that you use it. I am in favor of don't pay the tax, do the dumbest thing that can work, let the business needs evolve, and then once you understand truly what the business needs, go re-architected as necessary because that way you don't waste time upfront but you know that if certain business decisions are made, you have a set of work that you're going to have to do. And that's not the worst thing in the world because you can plan for it, you can make a series of decisions down the road that says, okay, if and when we cross this threshold, and we make the sale, we need to hire a team to go do this large scale refactoring, and pull our identification authentication system under separate cover, put it behind an API, rebuild the thing from the top down and by the way, clean up a lot of stuff in the process. That's a reasonable thing to do and it's better to do that in a more top-down way approach, than to kind of back your engineers into a corner where they're defensive all the time in every area. Utsav [49:12]: And maybe as like a final question to wrap up, let's say that you are like an IC who's thinking for the first time, should I be transitioning into management? I'm not so sure whether I'll actually like all of the work that is in management, but I feel like I can go up my career have a larger impact. As someone who has made this transition to and from multiple times, what is the first thing that you would recommend? Is it just explore it, try it, see if you don't like it come back. Is there anything else you would tell people to think about? Bharat [49:44]: You know, it's funny, I've had this conversation so many times with different people and I think a lot of people go into management for different reasons. So, whenever someone comes to me say, hey, I wanted to be a manager. My first question to them is always why? What motivates [50:00] you to do this? And a lot of times people what they're saying, pretty much very few people come to me and they're like, I love management and that's just what I want to do. Very rarely do I get that answer I'm sure there's a lot of people who feel that way but I don't get that very often. Mostly, what I get is people saying look, I want my career to progress and all the people who I think are succeeding are in the management of some kind. So, therefore, I must be in management of some kind and once I'm in management, then I will be succeeding. And the tricky thing is that it's not like engineering flows into management, it's not like it's the next stage in your career. That's saying work hard as an engineer, and then you become a manager, it's like saying, work hard as a painter, and then you become a ballerina, and they’re completely unrelated in so many ways. There are different motions, different skill sets, they require you to care about different things some people like it, and some people don't. So, the only way to know is to get some exposure to it and there's a lot of ways to get exposure to it. One of the ways to get exposure to it is to talk to a lot of managers, what do you like? What do you not like? What is your day like? Walkthrough the calendar, a lot of times I dissuade people from management merely by showing them my calendar. Let's look at my calendar, let's see what I'm doing today, okay, today, I have eight hours of meetings, by the way, right now, they're all on zoom. And that might be 10, to 14 meetings, a lot of 30 minutes one on one, some breaks, each one of them, has a certain amount of context for the individual. Here's what I'm trying to achieve in each of those meetings, here's like, is it relational? is a strategic? Is it transactional? What's going on in that meeting, what am I trying to achieve? And then you look at it like day over day, week, over a week, month, over a month, quarter over quarter, and you're like, okay, here's what it's like to be a manager, here's what I achieved, here are the goals and you're the outcomes and here's what I do to get there. And I think some people look at this, and they're like, I don't think I would enjoy that and some people look at this, and they're like, I don't think I'd enjoy it but I have to try it to know because it feels like the right way to progress. And it's a very personal choice, a lot of people don't know till they try it. The challenge is that because management seems like a step up and a large number of organizations do kind of glorify their management team so it feels like a step up. Very often people don't feel like they can go from a management role to a non-management role, without leaving the company. And so one of the things that I would say, it's not so much for the people who are leaping, it would be for the company, I would say, listen, make it clear that management is not the route to success, doing what you do well, and having an impact is the route to success. You can switch between a management track and a non-management track, but that's not the relevant piece. It is taking what you're great at, and getting to do that as much as you can in a place that has access so, that people feel comfortable trying it without having felt like they have to leave the company, once they don't like it. That makes it I think, easier for people to find out if they like it if they're good at it. And the other thing I would tell people is when you realize that it's not for you don't keep doing it, go talk to your leadership team, have a candid conversation, and say, okay, I tried this and now I need a break for a while. And a good leadership team will recognize your value proposition, and we'll find a way to make that work. So that you're not so ugly tied up in the like, well, I got promoted to being a manager and if I don't keep managing, it's going to be a demotion, even though I don't like it and I don't want to keep doing it and I'm doing it poorly, and I think that's the trap a lot of people fall into that I see. Utsav [54:02]: Yes, I think that makes sense it should be a lateral move, and it should be super easy to transition back. And that's what we'll even have the organization give bad management away, especially people who are not motivated about it. Bharat [54:14]: You know what they say, people leave good companies because of bad management and stay a lot longer than they should because of good management, people's relationships to their managers are paramount it's important. And anyone who goes into management should take a lot of management training. There's just a lot of things that are not intuitive, and you don't want to learn it on the job you should be trained in it it's important. Utsav [54:43]: Well, thank you so much for being a guest I feel like I've again learned a lot. Bharat [54:48]: Thanks buddy. It was fun I enjoyed it. Utsav [54:50]: Yes, and I'm excited to hear about your role in a few months. Bharat 54:56]: I'm excited to dive in; I think Coinbase is a great company. So [55:00 ] I'd be happy to give you an update once I can answer some of those questions. Utsav [55:06]: And congrats again. Bharat [55:09]: Alright. Thank you. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Sugu Sougoumarane is the CTO and Co-Founder of PlanetScale , a database as a service platform company. Previously, he worked at PayPal and YouTube on databases and other scalability problems, and he’s one of the creators of Vitess , a graduated CNCF database project that served as YouTube’s metadata storage for several years. Apple Podcasts | Spotify | Google Podcasts We discuss his early days at PayPal and YouTube, their initial scalability problems, the technology that helped them scale out their products, the motivation behind Vitess, its evolution into PlanetScale, and some of the innovative ideas coming out of PlanetScale today. Highlights 5:00 - Being interviewed, and hired by Elon Musk at X.com. Working with some of the PayPal mafia like Max Levchin and Reid Hoffman. 9:00 - Solving PayPal’s unbalanced books via a better Activity Report 15:00 - PayPal’s Oracle database and the initial idea of sharding. 20:00 - Early YouTube architecture and culture, and the consequences of explosive growth. 24:00 - Sharding YouTube’s database. 32:00 - The story behind Vitess . It all started with a spreadsheet. 40:00 - How a user with 250,000 auto-generated videos caused YouTube to go down. How Vitess fixed that, and the implications of a SQL proxy. 45:00 - The motivation behind keeping Vitess open source, and interest from other companies dealing with similar problems. The motivation behind going from Vitess to PlanetScale 53:00 - How PlanetScale can improve on some traditional relational database developer experience. How NoSQL was never actually about getting rid of SQL and was more about skipping traditionally rigid schemas. MongoDB’s support for SQL , and PlanetScale’s approach to solving schema changes. 58:00 - Technical innovations coming out of PlanetScale. 1:05:00 - Running databases in containers and the implications Transcript Utsav Shah : [00:15] Hey, welcome to another episode of the Software at Scale podcast. Before we get started, I would just like to ask a quick favor from any listeners who have been enjoying the show. Please leave a review on Apple Podcasts. That will help me out a lot. Thank you. Joining me today is Sugu, the CTO and Co-Founder of PlanetScale, a managed database as a service platform company. Previously he worked at YouTube for a really long time from 2006 to 2018 on scalability and other things, including Vitess and Open Source CNCF project, and before that he was at PayPal. Welcome. Sugu Sougoumarane : [00:53] Thank you. Glad to be here. Utsav: [00:56] Could you maybe tell me a little bit of, you know, when you got interested in databases and MySQL? Cause it seems, at least from my understanding you've been interested in distributed systems, my sequel databases for like a really long time. Sugu: [01:11] Yeah. So, I would say, the first serious involvement with databases started when I joined Informix in 1993. That was the time when there were three huge database companies, Informix versus Cyber's versus Oracle. And I guess eventually Oracle won the war, but that was the first time when I came in touch with databases. I specifically did not work with the engine itself, but I worked on a development tool called 4GL, which was popular those days; not there anymore. So that was actually my first introduction to databases. Utsav: [ 02:08] Okay, and I guess you took this skillset or this interest when you moved to YouTube from PayPal. So, what were you doing at PayPal initially and what made you decide that you want to work at YouTube? Sugu: [02:25] So PayPal, I guess when I moved from Informix to PayPal it was mainly because this was around 2000. This was the time when the internet boom was happening, and it was clearly obvious that Informix had fallen behind it; like they were still trying to make a client-server work and that was pretty much a dying technology. That's when I kind of decided to make a change. It was somewhat of a career change when you realize that the technology, you're working on is not going to be around much longer. I don't know if other people saw it then, but it was pretty obvious to me. So that's the time when I decided that, you know. I kind of made a decision to start back from the beginning because you'd be surprised in year 99, not many people knew how a website worked. Now, it's probably common knowledge, but I had no idea how a website worked. 03:36 inaudible requests come in, you know, what servers do what, and those things. So, it was all kind of unknown science to me. So, for me, it was like I'm starting a new career; like even within software engineering, each of these things sounds like a different career, you know, there are choices. It is as if I stopped being software engineer A and now, I'm going to become software engineer B. So, the only skills I could carry over were my ability to write programs. Beyond that, I knew nothing. And I had to learn all those things from scratch again at PayPal. By the way, I did not join PayPal directly. I actually joined X.com, which was founded by Elon. So, Elon actually hired me and then later, X.com and PayPal merged, and that's how we became PayPal. Utsav: [ 04:41] I think if you can tell us a little bit about Elon Musk, I feel like this podcast will just shoot up in popularity because of anything to do with Elon Musk now. Sugu: [04:50] Alright. There is this story I think I've said that it's not really commonly known, but at least, I'd say I might've said that in one place, which is that, actually, I failed the interview. I went for the interview and Elon thought I was not worthy. They decided to pass on me. But then I was so excited when I found out, at that time I didn't know who he was, this was the year 2000. I mean he had just sold his previous company, but there were many other people who had done that. But when I found out what he was building, I got so excited that after I heard that he passed on me, I sent him an email saying why he should hire me. And somehow that email convinced him to bring me back for a second round. And then I got hired and the rest is history. Utsav: [ 05:55] So interesting. What is it like working with him? Did you work closely with him? How big was X.com at that time? Sugu: [06:02] It was very small. I think we were about 15 engineers, 12, 10, 15 engineers. I think the total was 50, but most of the other people were customer support type of thing because it was a bank, right, so there was a lot of customer support. Yeah, at that time, it's surprising many of those people that I worked with, not just Elon, are now celebrities like Jeremy Stoppelman who was like sitting right next to me. I remember asking, Jeremy Stoppelman is the CEO of Yelp in case you want to look it up. So, I remember I went there and he was the guy next to me. So, I said, "Hey, you know where's the bathroom?" [06:53] "Where do you get coffee?", and sometimes you say, "Yeah, it's there", sometimes you say, "I don't know". And finally, I asked him, "So how long have you been here?" "Oh, I joined yesterday". I literally joined the day after Jeremy; I got hired. Yeah, so those were good times. So, Elon himself, by the way, like SpaceX, he used to talk about that [07:21] as his dream, you know, going into space. And he's really inspiring when he talks about it; like he starts talking and everybody just shuts up and listens because he used to talk about him being broke. There was a time when he was broke, where he said he didn't have money for food. Did you know that Elon was broke once in a while? Utsav: [ 07:49] I feel like I've read about it or seen some YouTube video where; now there are so many YouTube videos where they take quotes of Elon and it's like inspirational stuff. I think I've seen some of those. That's so fascinating. And yeah, now that clearly became the PayPal mafia with so many people making so many successful companies and stuff. Like, did you work closely with any of the others, like Jeremy Stoppelman? Sugu: [08:15] Yeah. There were lots of them. So, I actually worked with Max a lot more right after the merger. So Max, definitely, Reid Hoffman, although I don't think I interacted much with Reid Hoffman, there is Roelof, for sure, Steve 'Chairman' Chen, obviously that's how I ended up at YouTube. My memory fails me, but if you look at the mafia, yeah, I have worked with all of them. Utsav: [ 08:52] Did you work on the Igor program, like the fraud detection stuff, which I've read about publicly? Sugu: [08:58] Oh, no, I did not. I worked on some fraud systems, but not Igor specifically. I am trying to remember who worked on it. It was probably Russel Simmons. Utsav: [ 09:14] So what did you do at PayPal? I am curious. Sugu: [09:18] So the thing is, this is a good lesson, the struggle I was having initially at PayPal was recognition. There were a couple of things. I was kind of an unknown there, nobody knew who I was, and I was part of the merger, right? I came from the X.com PayPal merger, and there was a lot of back and forth about which technology to use. Eventually, we decided to continue with the original PayPal technology and stuff using the X.com technology. So, the struggle I had was, how do I prove myself to be worthy. So, in software engineering, you have to show that you are good and then they give you good projects. You don't get to choose the projects because you get assigned projects, right? So at least that's how the situation was then. And by the fact that I was an unknown, all the good work was going to the people who were already recognized for how good they were. So, there was this one tool called The Activity Report, which basically tallied the total of all transactions on a daily basis, and say this is the total in this category and credit card category, ACH category, and that kind of stuff, and used to report those numbers. But the auditors came in and they said, you know, with these transaction numbers that this report produces, you need to tally them against your bank accounts; against the total balances that users have. The first time the auditors came, there were like a few million dollars off, you know? And they're asking like, where is all this money? I don't know if you remember when PayPal initially was founded, the business model was to make money off of the float. When we transfer money, there’s a float. We'll basically invest that money and whatever money we make out of that. So, people are like, 'oh, it must be the float', you know, that difference was because of the float and the auditors were like, 'no, that explanation is not sufficient. You need to show the numbers and know that they add up and that they tally’. And so, finally, we figured out where all it is on the tally. But the problem was that PayPal was software written by engineers and how do you create transactions? You insert a row in the transaction table and commit, you insert the row in the transaction table, update a user's balance, and then do a commit, right? So that's how you do it. But sometimes the transactions go pending, in which case we don't update the balance. Sometimes we complete the transaction, and then after that, you update the balance. And guess what happens when you start doing these things. There's this thing called the software bugs that creep up. And so, every once in a while, after they release the transactions wouldn't add up to the user's balance. And sometimes they won't tally with what's actually in the bank accounts. And so, every few days, the activity report is used to produce discrepancies. And because the software was written inorganically, there was no easy way to find out like you have, you know, million transactions. How do you figure out which one went wrong? The numbers don't add up to the totals here. And every time this used to happen an engineer was assigned to go and troubleshoot. It was the most hated job because it's like finding a needle in a haystack. Like you'd say, 'Ah, activity report conflict. Now I got assigned to do this. Why is my life so bad?' You know? So that was the situation. And then, at one point in time, I said, you know what? I'll own this problem. I'd say, 'Just give me this problem, forget it. I will investigate this every time it happens. And I will also fix the software to make it easier.' So, I took on this and I rewrote the entire program, such that, to the extent that, at the end of its maturity, it would find out exactly, previously it was all totals, that at the end, once it got fully evolved, it will say 'this particular account has this discrepancy and most likely this is the transaction that caused it'. So, it will produce a report or a list of conflicts with a very clear clue about where the problem is, and then you spend five minutes and you know what went wrong. It would find software bugs in it because sometimes somebody forgot to update the balance. It'll say, 'Hey, you created a transaction and didn't update the balance, you know, so here go fix it'. So that it used to do. And eventually what had happened was because it was so solid, it kind of became foundational for PayPal. Like people used to ask for reports about, you know, transactions, and they would say, 'How do you know this is correct?' and we'd say, 'Oh, I validated it against a not-so-good activity report'. 'Okay, then it's good. I trust these numbers. So that kind of stuff. So, making that software and taking that pain away from all the other engineers, got me accepted as one of the cool engineers. Utsav: [ 15:12] So interesting. How big was the engineering group at that time? Sugu: [15:19] So the PayPal by the X.com site, there were about 15 and on PayPal's side there were also 15. So, we were about 30 to 40 engineers. Utsav: [ 15:28] And once you got accepted by these people, what were your next projects, I guess? Sugu: [15:33] So having gotten accepted, I became the kind of person who 'd say, 'If you give a problem to Sugo, you know, you consider it a salty kind of thing'. So, which means that anytime PayPal had a struggle where you say, 'Oh, you need the top brains on this', I used to get called. And the biggest problems later were all about scalability. So, I kind of moved into the PayPal architecture team, and used to solve all problems related to scalability. And the scalability was always the Oracle database, because it was a single instance database, and every time, the only way we could scale it was vertically. So, the idea of sharding was actually born there for us. We needed to shard this database. But it eventually succeeded only on YouTube. We didn't succeed at sharding at PayPal itself, but we kind of knew how to do it. We kind of learned; we went through the learning process when we were at PayPal about how to shard a database, but by then the [16:46] had taken place. There was a drift and lots of things took place after that. Utsav: [ 16:50] That's so interesting. And is that where you met Mike? You were discussing some of these stories, [16:55] Sugu: [16:56] Yeah, Mike was reporting to me. So, there was Mike, there was another person called Matt Rizzo. They were the top engineers, you know, one of the top engineers at PayPal. We still use some terms that only Mike and I would understand. We'd say, 'Oh, this is APCD right?' and then we'd laugh. Nobody else, I would say, would understand what that means. [17:28] ACH process completed deposits, Bash tool that he wrote. You probably know that Mike is really, really good with Bash. Utsav: [ 17:39] Yeah. He knows his Bash. Sugu: [17:42] He knows his batch. That was kind of my fault or doing because there was a time when I just told him a problem, you own this problem, you have to solve it. And he'd say, 'Do you think Bash is a good idea? Sounds cool'. I said, 'If you think it's a good idea, it's a good idea. It's up to you. It's your decision'. So, he went and he is still scarred by the fact that I let him do it and said, 'Why did you let me, why didn't you tell me it was a mistake?' Utsav: [ 18:15] I can imagine this very horrible process at PayPal, like adding and subtracting money and it's all written in bash. Sugu: [18:23] That wasn't written in Bash, but there were some tools that we wrote in Bash. Yeah, the money part was all SQL, SQL+ actually. [18:36] I guess, that could be called the Kubernetes of those days. You were writing an orchestration tool and that we wrote it in bash. Utsav: [ 18:46] Okay, that's so interesting. And then you're saying that the idea of sharding was born at PayPal, or like the idea of sharding or my SQL database, like you figured out how to do it. Why would you say you did not complete it just because there was not enough time or, at that point did you decide to leave and go to YouTube? Sugu: [19:04] There were a lot of reasons, some of which were changes in management at that time. There were the PayPal engineers and there were the eBay management that had acquired us and the relationships were very strained. So, it was kind of a very difficult thing. So, there was no coordinated, what do you call it effort? There was no focused, coordinated effort towards solving it. It was all over the map. There were a lot of committees spending a lot of time in conference rooms, discussing things, not coming to conclusions, you know, that kind of stuff where there were just 'too many cooks in the kitchen' type of thing. I think the core first kind of figured out how this needed to be done, but we were not able to, you know, push that idea forward, but you finally made that happen at YouTube. So, which proved, you know, it kind of absolved us of you know what you're doing. Utsav: [ 20:20] So, yeah, maybe you can talk about the earlier days at YouTube. What was YouTube like in 2006? How many people were there? Was this like pre-acquisition, post-acquisition? I don't know the exact - Sugu: [20:29] -just around the time of acquisition? So actually, around the time when YouTube was founded, I actually went to India to open the PayPal India office. But I used to visit YouTube every time I visited. I came to the US and I used to come pretty often. So, I've been closely following, at some point in time, Steve, and Matt kept hounding me saying, 'What the hell are you doing there? This thing is catching fire. You need to come and join us right now'. So finally, I came back and joined them. But yeah, the YouTube culture was kind of a carryover from PayPal. It was very similar, lots of heroics, you know, lots of, like each person was an impact player. Each person made a huge difference. Like one person owns all of the search, that kind of stuff. One person owned the entire browsing experience. There were only like 10 engineers. So, everybody owned a very important piece of the stack. Utsav: [ 21:50] Okay, and what were some of the early problems that you ran into? Or like, I guess what were the fires that you were fixing as soon as you started? Sugu: [21:59] So the one big thing I remember, there was this rule at YouTube where you join, you have to commit on your first date. Utsav: [ 22:11] Okay. Sugu : [22:12] I broke that rule. I said, I'm not going to commit on the first day. But I spent time learning the code and produced a huge PR, a huge pull request, that we wrote an entire module because it was badly written. So, the biggest problem I found, I felt at least at that time, was because the code was organically grown and incrementally improved, it was going to run out of steam as we added more features. So, I felt that we needed, you know, to clean it up so that you can put a better foundation underneath so that you can add more features to it. So that's basically the first work that I did. But then later after that, the bigger problem also became scalability because we were single, we were uncharted. There was only one database that was called Main and we had to do the re-sharding. So that was the biggest challenge that we solved. We solved the re-sharding [23:22] test later, and adopted that sharding method. But when we first started, it was at the app layer. Utsav: [ 23:33] Okay, at what scale did you have to start thinking about sharding? How many engineers were there? How much data was going in and what was being stored in this database? Was this just everything of YouTube’s metadata in this one large database? Sugu: [23:48] Yes, all the metadata, not the videos themselves because the videos themselves were video files and they were distributed to CDN, but all other metadata, like video title, likes, dislikes, everything else was stored in my SQL database. It was running out of capacity very soon, and the hardware wouldn't have kept up. Vertical scaling wouldn't have taken us much further. And that's when we said, 'Okay, we need to shard this'. We made a very tough call, which is that we are going to break cross short, transactional integrity. It was actually not as big a deal as we thought it was going to be because of the nature of the app, maybe because users just mostly work with their profile. There were a few things about comments because I posted a comment to you, so where do you put that comment? and some little bit of issues around those types of things, but otherwise, there was still quite a bit of rewriting because an application that asses an unsharded database and makes it [25:10] is non-trivial no matter how you put it. We had to build a layer under the app and write everything, change everything to go through that layer, follow certain rules. And then we changed that layer to accommodate sharding. That's basically how we solved the problem. Utsav: [ 25:27] Yeah. That layer has to basically decide which shard to go to. So, you were just running out of physical disc space on that My SQL box which seemed like the problem? Sugu: [25:37] I'm trying to remember, whether it was a disc, it might've been a CPU actually. Utsav: [ 25:42] Okay. Sugu: [25:44] I think it was the CPU. It's either CPU or memory. I think we could have a disc. I'm trying to remember if it's CPU, memory or IOPS, most likely memory, because of working set size issues. I could be wrong. Utsav: [ 26:14] I'm just curious, you know, do you know how many videos you had in YouTube at that time when you started running, [26:20] Sugu: [26:21] I remember celebrating. Let's see it was either a hundred million or 1 billion, I don't know. Utsav: [ 26:31] Okay, 1 billion videos and that's when you start running out of space? [26:35] total video, that just shows you the part of like my SQL and like [26:40] Sugu: [26:42] Yeah, it's amazing that I don't know the difference between a hundred million and 1 billion, you know, but something, yeah. Utsav: [ 26:52] Sounds like [26:53] I don't even know how many videos will be on YouTube now. I can't even imagine. Sugu: [26:59] The reason why it works is that most videos are inactive, right? You just insert a row and move on, and so it is more a question of what is a working set, right? How many of those videos are popular? They come and live-in memory and we had a Memcached layer also in the front, and so it was more a question of how many popular videos people are trying to hit and can all that fit in memory? Utsav: [ 27:31] Okay, that makes a lot of sense. Then you could put a caching layer in front so that you don't hit the database for all of these popular videos. That's so interesting. And, given all of this, it makes sense to me that you can shard this because videos, you can totally imagine, it's relatively easy to shard, it sounds like. But yeah, what do you think about resharding? That seems like, how do you solve recharging? That sounds like a very interesting problem. And maybe just to add some color to what I think resharding means, it's like you have a data on like my SQL box A and you want to move some of it to my SQL box B or like split from one database to two and two to four. And you have to somehow live transfer data from the database, like, how do you solve that? Sugu: [28:19] So, there's the principle and there is a mechanic, right? Typically, the mechanic is actually more, kind of straightforward, which is typically when you start sharding, the first thing you do is actually pull tables out. Like if there are 300 tables, you'll say, you know, these 20 tables can live in a separate database. It's not really called sharding or resharding cause you're not sharding, you're splitting. So, the first thing people usually do is split a database into smaller parts because then each one can grow on its own. The sharding aspect comes when a few sets of tables actually themselves become so big that they cannot fit in a database. So that's when you shard and when you make the sharding decision, what you do is, the tables that are related, that are together, you shard them the same way. Like in YouTube's case, when we did the sharding, we kept the users and their videos together in the same shard. So that was actually a conscious decision that users and videos stay together. There is actually a sharding model that allows you to represent that as a relationship. Now that we test actually exposes, which is basically there's some theory crafting behind there. But from a common-sense perspective, you can always use about 9 users and their videos, being together. So that's the mechanic part. On the principal part actually, the way you think about sharding is basically beyond the scale, you have to actually rethink your application as being fully distributed with independent operators working by themselves, working on their own. That means that some features you often have to forego for the sake of scalability. In other words, I mean, like crazy features, like what is the real time value of the total number of videos in the system? I want that up to the millisecond. That type of question is unreasonable for a sharding system. You could probably do a select count star, against a small database and can actually do it. But if you are having tens of thousands of servers answering that question becomes harder, right? And then at that time we are to start making trade-offs like, is it really important that you have to know it up to the, like, what if it is up to the minute? Is that good enough? So those kinds of traders here make at a high level and most people do because you cannot go to that scale unless you make these tradeoffs. But once you make these trade-offs, sharding becomes actually very natural, because the one thing that you make, for example, if you are going to have the entire world of your customer, you are going to have, you know, 7 billion rows in a user account. And sometimes many of them create multiple accounts. So, you are going to have billions of rows. You cannot build features that put them all together. You have to build features that keep them separate. And as soon as your features follow that pattern, then sharding also kind of drives that same decision. So, it’s kind of more or less becomes natural. Utsav: [ 32:21] Okay, that makes sense to me, and it seems like you said that you have a sharding solution that predated the tests. So maybe you can talk about the motivation behind the test, like, and maybe the year that this was in like, well, if you have any numbers on, you know, at what point did you seem [ 32:36] Sugu: [32:36] [ 32:37] I'm finding myself surprised that there are details I should know that I don't know, like the number of videos we had when we were sharding, right? It was 2007, 2007 is, actually wow,15,14 years ago. So, yeah it has been many years. So, in 2007 was I think when we did our first sharding, if I remember correctly. Maybe 2008? Around 2007 or 2008, one of those few years was the first time when we did sharding. But the test was not born because we needed to do sharding, obviously, because it was already sharded it was born because we couldn't keep up with the number of outages that the database was causing. That was actually why it was born. It was quite obvious that as the system was scaling, in spite of being sharded, there were a large number of things broken within the system that needed fixing for it to work. And the system overall, when I say system meets end to end the entire system, and that includes the engineers and their development process. So that's how big the problem was, but we didn't solve the entire problem. We said from a database perspective, what can we do to improve things, right? So that's kind of the scope of the problem. And more specifically Mike actually, took off, and I think, spent some time in, Dana Street coffee at Mountain View and came up with the spreadsheet, where he described, he kind of listed every outage that we have had and what solution and how did we solve it? Is there, what would be the best remedy to avoid such an outage? So, when we actually then sat down and studied that spreadsheet, it was obvious that a new proxy layer at any build and that is how we test was born. And the whole idea is to protect the database from the developer. For example, they were like, at that time 200 developers, if you look at it right, the developers, don't intentionally write [35:10] and a developer doesn't write a bad query all the time. But if a developer wrote a bad query, only once a year with 200 developers, 50 weeks a year, you do the math. How often do we see outages in the database? Almost every day. That's what it amounted to; we were seeing outages almost every day. And very often they were actually put 'quote and quote', bad queries coming from that one user role that they will never repeat, but they've fulfilled their quota for the year, you know? And so, what we did was we found common patterns where that would cause data. So, the big change that we made was if I wrote a bad query, I have to pay the price for it, not the entire team, right? Now today, if I wrote a bad query, that query runs on my SQL, it takes on the entire database. So, we wrote with tests in such a way that if you wrote a bad query, your query would fail. So, in other words, what we used to look at is how long is the query running? If the query is running too long, we kill it. If the query is fetching too many roles, we return an error. So those kinds of defenses we added in with tests early on and that pretty much took it a long way. The other big feature that we deployed was connection pooling, which my SQL was very bad at. The new SQL is slightly better. It's still not as good. So, the connection pooling feature was a lifesaver for them too. Utsav: [ 37:00] That makes a lot of sense to me. And maybe you can tell us a little bit about why my SQL didn't have the capability of defending itself. So, you might imagine like, you know, just from somebody who has no my SQL experience, they can just be like a [inaudible 37:16]. Sugu: [37:17] It's very simple. The reason is because my SQL has, what do you call, claimed itself to be a relational database. And the relational database is required, you know, to not fail. So, we at YouTube could be opinionated about what query should work and what query should not work. That freedom my SQL doesn't have. Every query that is given to it, it has to return the results, right? So that's the rule that you had to, to be qualified as a relational database, you had to fulfill that pool and that pool was its curse. Utsav: [ 38:01] Okay, but you could configure maybe like a session timer or something, or like a query timer, or would that lose deference which is not good enough basically. Sugu: [38:09] They actually added those features much later, but they didn't. Yeah. The newer, my SQLs now do have those features, but I don't know how- but they are all behind. They are not as usable because you have to set it in a query parameter, you have to set it in your session, you know, so it's not as, like by default you just install my SQL, and start running it. You won't have this property, you have to configure it, which is another hurdle to cross. Whereas in Vitess you installed the test, you'll start failing, you know, people will complain like, we see the opposite problem, right? Oh, [38:53] My SQL, why is the test failing it? Because you are trying to pitch a million rows. If you want to do 10 million rows, you have to tell the tests that you want to get 10 million rows and then it will give you them. But if you just send a query, it won't give you a million rows. Utsav: [ 39:10] Yeah. What do you think was like the one really important? Which as you said, there's connection pooling, it's also limited, bad where he's like, how did it figure out that a query was not one, maybe heuristic is it's turning too many rows and it would just fail fast. What were some other key problems? Sugu: [39:27] Oh, there were, we added so many, the coolest one was, when engineers wrote code, they are, like me too, I'm guilty of the same thing. You think how many videos could a user possibly upload? And we were thinking manually in the [39:51]. For me to upload a video, I have to go make it, edit it and then upload it. You know, we will have, you know, 2000 videos where we'll be a number and, you know, look at how old YouTube was. It was two years old. How many videos can you produce in two years? So, 2000 videos. So, selecting all videos of users, not a problem, right? So, we never put a limit on how many videos we fetched. Then we selected a video by using, and then there was one user who ran a bot. I think there's no other way. Well now there are accounts that have billions of videos, but at that time that user had 250,000 videos. And that wasn't a problem per se, because that I couldn't have been around, but it became a problem when that account got high, got listed on a YouTube stream page. Yeah. The index page, right. The landing page. So, which means that every person that went to youtube.com, you should a query that pool $250,000, because there was, so the, one of the changes we made was that, any query that, that's why we added the limit, plus like, where you say, if you send an aquari with no limit class, we will put a limit. And if that limit is exceeded, we'll just return your arrow thing that you're trying to fish to NATO. So, there was one protection, but the cooler protection that we did was, there are some queries that are just inherently expensive, right? Like this one, if this, I think in this case, I think it might've been a select count star. If it's a select count, it doesn't fish to 50,000 rows, but it scans to 50,000 rows. So, the, the feature that we did was if a query got spam, like because of the fact that it's coming from the front page, which means that my school is receiving like a very high QPS at the same period in that case, what we do is we check if the query is already learning from some other request, if it is, if there are 10 other requests of the same query, they are not sent to my SQL. We wait for them to return and return that same dessert to all those requests. Utsav: [ 42:21] Okay. Would that be problematic from a data consistency perspective? Probably not. Yeah, Sugu: [42:26] It is actually, yes, it is but it is no worse than eventual consistency. Right. So, if you are reading, if you are reading from a replica, no problem. Right. If you are doing something transactional. So, the rule that we had at YouTube was if you're doing something transactional, so there's also another reason like the rule is if you read a rope, asse that rule is already stayed. Why? Because as soon as you read the rule, somebody could have gone and changed. So that stainless guarantee kind of carried over into this will also is that when you do a select, when you do a select of the query, it may be slightly outdated because one thing is you could be reading from the previous period that is running, or sometimes it could be going to a replica. So it wasn't, it was something that nobody even noticed that we had made this change, but it's like, we've never had an outage later to, you know, very spam after we did that feature. Yeah. But more and more, more importantly, the rule that we followed was if you want to make sure that the road doesn't change after you read it, you have to select for update. So, if you plan to update a row and update based on the value that you have read, you have to select for update. So that's a rule that all engineers followed. Utsav: [ 43:56] Yeah. It makes sense. And it's better to just be explicit about something like that. Sugu: [44:00] Yeah. Yeah. Because of the fact that the, the, my SQL NVCC model tells you, right. The MVCC, the consistently it's called consistently, that's actually the actual practical Lim is something else. but that model basically tells you that the role that my simple serves to you is technically obsolete. Utsav: [ 44:24] Okay, and then it just makes sense to me. And you’ll open source with us. I remember hearing the reason is you never wanted to write another way to test. So, you decided to open source it from day one. Can you maybe tell me a little bit about, you know, the motivations behind making it a product? So, it's a great piece of software, which I used to buy YouTube from my understanding for like a really, really long time. At what point did you decide, you know, we should actually make this available for other companies to use. And what is the transformation of the liquid test for PlanetScale? Sugu: [44:58] So even before PlanetScale, even as early as 2014, 2015, we felt that we should start promoting the test outside of YouTube. And one of the reasons was, it wasn't a very serious thing, but it was kind of a secondary motivation. We felt that even for YouTube, for the test to be credible, even within YouTube, this product needs to be used by other people beyond YouTube. In other words, you know, for a company to trust a product, a lot of other people must be using it. That was kind of the motivation that made us think that we should encourage other companies to also use, [45:46] . And actually, Flip card was the first one. I think it was 2015 or 16 that they approached us and they said, hey, it looks like this will perfectly solve the problem that we are trying to solve, and they were the first adopters of the test. Utsav: [46:05] Okay, so that makes sense. And that's like it was open source and you felt that other companies should use it 2015, it sounded like Flip card was interested, but then what is the transformation of from then to, you know, let's make it a product that people can buy because this it's an open-source technology that, how do you go from that to product? Sugu: [46:28] Yeah. So, I can talk about how PlanetScale came about, right? So, the other thing that started happening over time as the test started evolving as a project companies like Flip card started relying on it, Slack started using it, Square started using it. And they were all scared of using the test. Why, because this, this is basically defining technology. It starts something, it's basically a lifetime commitment type of change. And the question in their mind is what, like, what's the guarantee that, like what YouTube is interested in making sure that, you know, it will continue to work for slack, right? Why, like they were, a large number of companies were worried about that part, where, you know, you do focus his videos by like, how can we trust the technology? How can we trust the future of our company, right on technology? What if you abandon the project and that kind of stuff? So, overtime, I could sense that hesitancy growing among people. And this was one of the contributing factors to us starting planet scale, where we say, we are going to start a company at a company that is going to make a commitment, you know, to stand behind this open-source project. So, when we made that announcement that we are starting the company, there was a sigh of relief throughout the test community. You know, finally, I can- Utsav: [48:07] - depend on this really important. Sugu: [48:08] I can depend on this project. It's not like it is here to stay. There is a company who is going to make sure that this project stays healthy. Utsav: [48:18] That makes sense. Yeah. Sugu: [48:21] There were other factors also, but this was one of the, definitely one of the major contributing factors towards making. So, I think at the end of the day, this is something, this is generally a problem in open source. You know, that, you say open source is free, but there is an economy behind it because the engineer's time is not free. And for me to spend time on an open source, I have to be paid money because, you know, I have my family to take care of, you know, I need money to live. And so, I'm expecting engineers to contribute to open source in their free time. it can work for smaller projects, you know, but once the project becomes beyond certain sites, it is not a part-time, it cannot be a part-time contribution. It is a full-time contribution and the engineer has to be paid money for it. And so there has to be some kind of economy behind that. Utsav: [49:32] That makes a lot of sense. And I think it also tells me that the problem that slack and square and these other companies are facing was just so large that they were, that they wanted to use the project, given all of these issues with, you know, we don't know the future of this project, but there was no other solution for them given that existing, my SQL stack and their hyper-growth and the kind of problems that they'd be dealing with. Sugu: [49:59] Yeah, that's right. So that is something I still wonder about, right. If, if not retested, what could have, is that an ultimate, you know, solution, the alternative doesn't look as good either because the alternative would be a closed source product. Like, you know, that scares you, that even more scary, because what if the company COVID out of business, right. At least, in the case of Vitess the source code is there. If planet scale goes away, slack can employ engineers to take over that source code and continue to operate so that confidence is there. So that is one big advantage of something being open source, gives higher confidence that, you know, in the worst-case scenario, they have a way to move forward. Utsav: [50:53] And when you, when you finally decided to make a company and a product, like, did you, what are some things you've learned along the way that, you know, you need to have in with tests for more people to be able to use it at a first impression to me, it's like, there are not millions of companies running into these scale issues, but like, what are some interesting things that you've learned along the way? Yeah. Sugu: [51:14] All I can say is I am still learning. Yeah, every time I come there, I realize, oh, you know what I thought was important. It's not really that important. like somebody asked me, if you are agonizing so much about data integrity, why is Mongo DB so popular? Right? I mean, Mongo DB is pretty solid now, as far as I can tell, but it was not solid when it became popular. Right. They made it solid after it became popular. So, one good thing. If he doesn't, did not, for the longest time did not care about data integrity, but people still flock to it. So, there is always a trade- off in what is important. And you have to find out what that is and you're to basically meet people where they are. And in the case of the test, it’s actually usability, approachability, and usability. That is a bigger hurdle to cross before you cannot up your tests. And that is a big, much bigger hurdle than, you know, the data integrity guarantees that have people looking for. So those are for example, one lesson, where you have to, so the one thing we test is we used to listen to, we used to, we have our ears open. You're constantly listening to people, giving us feedback about what is not working for them and fixing those problems. It turns out that is insufficient. Who are we not listening to? We are not listening to the person, the quiet person that drives over the weekend, doesn't work for them and quietly walks away. Right? That's that voice we never heard. We only heard the voice of somebody who tried to use the test and like went past the problem of not being able to use it, got to use it, found a missing feature and is asking for it. So, we've been focusing on very specific features about how to make the test, but we completely forgot about, you know, how to make it more approachable. Right. So those are the problems we are now solving at PlanetScale. Okay. Utsav: [53:41] That makes a lot of sense. Yeah. Maybe you can tell us about one usability feature that you feel was extremely important that your live build. Sugu: [53:49] The biggest one for example, is a schema deployment. Subconsciously every developer will cringe. When you say, use a database and manage your schema, use a database, they would like, but as soon as you say schema, almost every developer cringes, because, oh my God, I have to deal with that. Bureaucracy, like send it to for approval. DBA is going to detect it. our schema is going to break something. Some app is going to break all those headaches I have to deal with. So, what we have done at planet scale is give you an experience very similar to what you would do as a developer with your source code. You do the same thing with your database. You can take your database ratchet, applied, schema changes, tested, and then have a workflow to much it back into production and very natural. And it looks almost similar to how you develop your source code, right? So, it's a very natural workflow and it applies to the database and you don't have to deal with, you know, that whole bureaucracy, it's all part of this nice, nice workflow. It handles conflicts. If multiple people want to change the schema at the same time, it pipelines them the correct way. And if they cannot go together, we will flag it for you. So those are really, really nice feature. And that seems to be really going well with the developers. Utsav: [55:28] Interesting. Yeah. And I think it speaks to, you know, why databases like Mongo DBS are popular? Sugu: [55:33] Yeah. No schema is not a no SQL that made Mongo DBS when it's the no schema part, Utsav: [55:43] Being able to just tweak, add one more field, not be blocked on like a team to make sure not exactly worry about back-filling and all of that. It is a game changer and maybe a little ashamed to admit actually, like I use Mongo DB at this job and it's not half bad. It's pretty good. Sugu: [56:00] Yeah. Yeah. Now they are trying to bring a skill back to Mongo DB and people like it. Right. So, the real problem was actually schemas, the fact that you can't just add something and move on. It's so hard to do it in a database. Utsav: [56:21] Yeah, and maybe you can tell me today, like, what's the difference between VTS open source and PlanetScale? Like what is the extra stuff? Sugu: [56:29] They are very orthogonal, right? Because what we are building in PlanetScale is a beautiful developer experience. And, what we test is giving you is the other part that, most people that have a good developer experience MIS, which is, a solid, scalable database in the backend, like we could build this developer experience on top of stock, my SQL, but people are hesitating to adopt that because they know that it won't scale. At some point of time, I'm going to hit a limit and I'm going to spend the 90% of management energy trying to figure out how to scale this. That's what companies experience when that problem is taken away from you. Right. You get a good developer experience, but when it comes to scaling with us, have you covered? Utsav: [57:23] Yeah. That makes a lot of sense. And does that mean that as a developer, let's say I want to start a startup, the model. I have a little bit of experience with my SQL. Can I just start with PlanetScale on day one? Sugu: [57:35] What would I get? That's exactly what [inaudible]. Yeah. You basically spend almost no time configuring your database. You start, you said, just go to planets, click, like you click a button, you instantly get a database and then you just start developing on it. So, zero bad years, which is what you want in a startup. Utsav: [58:02] And, and the experience I get is just stock my sequel to begin with, except for it would have all of these things like automatic limits on queries. And it would let me shard as soon as I become too big, Sugu: [58:15] As soon as yeah, exactly. But interesting. Yeah. So basically, what they have to do is make sure that, you know, kind of guide you towards good programming practices. You know what I mean? If you're just running unbounded queries, it's going to tell you, this is not good. So had bounced your [inaudible 58:36] , that kind of stuff. So, we can be opinionated that way. Utsav: [56:37] So I'm curious about, you know, all of the technical innovation that I see your Twitter feed and YouTube talk about, you know, you're a handler, I'll have like an automatic benchmarking tool. You're like optimizing, protocol handling and letting go. What are some of the keys, like other innovations that you all are doing at that PlanetScale? Sugu: [58:56] So there is one, I don't know if you've seen the blog series about consensus, generalized consensus. Okay. So, I think that's a decent idea. I feel like it's a decent innovation. What we are doing is there's Paxus raft. You know, those in my eyes are rigid algorithms and they have rigid assumptions about how nodes should talk to each other. And those algorithms were opted for industry best practice. But if you flip it around, what problems are they solving, right? You identify the problem. And then you start with the problem first and come in and say, this is the problem I want to solve. What system would you build? I don't think a raft or Paxus would be the systems we would have been. Right. And what problems are we trying to solve? We are trying to solve the problem of durability, right? I have a distributed system and I say, save this data. And the system says, I have saved your data. And the guarantee that I want is that the system doesn't lose my data, essentially. That is the problem that all these consensus systems solve. But then when I come in here, I'm in a cloud environment, I'm on AWS. I have the ones that have regions, right. I can say for me, durability means my data is across two Lords. Or my, my notion of UWP is data is, at least in one other zone or one of the regions. Right? So, if you specify a system that can make sure that whenever you ask, ask people for the right data, it makes sure that the data has reached all those nodes or regions. And then it gives you, and then later, if there are failures, it recovers based on the fact that the data is elsewhere, right? So that is what the thing says. So, if you look at it, top-down, you come up with a very different approach to the problem where, in which case raft and Paxus are just one way to solve it., so what I have explained is what is the generic approach behind solving the problem of durability, how Paxus and raft solve it, but how we can build other systems that are more effective at more accurately meeting the driving requirements that are coming from the business. Utsav: [1:01:39] This reminds me of semi sync replication and like my SQL. Sugu: [1:01:44] Exactly. So, you could build a consensus system using my SQL semi application that gives you the same guarantees that raft gives you, for example. Utsav: [1:01:55] Okay. Sugu: [1:01:56] But then the theory behind this is kind of what I've explained in the blog feed. So that I think is one great innovation. And the benchmarking is another, is another good one, that may be a PhD that can make them out of that. Or at least a MTech thesis. I don't know what he's going to do, but float on is he's actually a student who has done an awesome job with them. So, some research work is going to come out of what you've done. Yeah. Utsav: [1:02:28] That's awesome to hear, because you don't generally hear of smaller companies and like academy work coming out of them, but plan skills, exactly. The kind of place where something like that would come out. Sugu: [1:02:40] Yeah. There's a lot more there. The other cool one is, the V replication or the materialization, feature, which is to be able to materialize anything into anything., and, it works so well within the model of, the Tessa sharding scheme, that you can actually, run, that is a foundation for a large number of migrations. It's a foundation for applying. DDLs like today in planet scale, you say, deploy this video, we do it in an offline fashion as in like with no downtime and the safeties, you can even rework it. Right. So those, so the way I see it is a DDL is a form of migration. And it's just one way of expressing a migration as a combined, but there are bigger migrations, very moving a table from one database to another where you Rashad a table. These are all DDLs of some sort. And if you can express them as videos, and if the system can do it offline in the background with no downtime and with reversibility, I think that is a very powerful mechanism to make available for you to use this. Like me, often some of these projects take months, years. Yeah, you can now do them in the test with just a simple command. Utsav: [1:04:16] Yeah, this seems like the kind of things that are just completely unheard of or impossible in any kind of other data system that you have. Sugu: [1:04:22] Yeah, near stone because it's science fiction, this is science fiction. Utsav: [1:04:31] And I also let that. That test can also be deployed within Kubernetes. So, you can get all of the benefits of, you know, like the same way as you get benefits of Kubernetes with stateless services that things can just move around. You can also use that to your benefit with tasks where your data can be moved around and stored in multiple places. Am I completely misremembering stuff for like, [cross-talking 1:04:59]? Sugu: [1:05:00] No, it's not like I can reveal to you that planet scale, the tests are all hosted in Kubernetes or everything runs in Kubernetes. And, we have a very large number of clusters. We may already be, or, may soon be, you know, the largest number of key spaces in the world. All in Kubernetes, the entire thing is in Kubernetes. Utsav: [1:95:29] So, that's like at least traditionally, so do you run on containers or is it just like, yeah, that's a tradition, yeah. If people say that you shouldn't run stateful stuff in containers and clearly bucking that trend. So maybe you can just talk through that a little bit. Sugu: [1:05:47] Yeah, totally. So that's, so the, the, the history goes behind the history behind this goes all the way back to, how we test evolve him YouTube and Google, right. On YouTube, originally, we wrote with testing our own dedicated hardware, but we were required to migrate it to Google's work, which is Google's cloud. And, Vitess had to, and the problem was that my sequel was not a first-class citizen in Google sport. Which means that you couldn't use it as a service. There was one team we actually collaborated with the team, but it was still not a first-class citizen. So, the way we did this was we actually landed with tests as, because, so the only way we could run with tests inside Google work was as a stateless application. So, we actually added the durability, survivability of nodes going down in, and Vitess such that anytime, the cloud can come and take down your server and we test should handle that. So that wasn't with Tessa's DNA that we had to build, and make it part of the test. So then later when Kubernetes came out, you had already for it because it was already running in what I call a data hostile environment., and therefore, like, because of the fact that we land at a massive scale inside Google board, we were able to, we have the confidence that it will run well. And he has shown them doesn't run well so far. Utsav: [1:07:34] Okay. Does that mean that the, my sequels basically live outside the Kubernetes and then you have a large part of like all of the tests it's running? Sugu: [1:07:39] Inside the container. Utsav: [1:07:40] Okay, Interesting. Sugu: [1:07:43] Yeah, it's my read on my sequels inside parts. We are not the first ones to do it by the way jd.com, HubSpot, and, there's another company. I forgot their name. They've been doing it for many years. Utsav: [1:08:03] Okay, and what are the benefits of running it inside a pot? I guess your perspective, Sugu: [1:08:06] Ah, manageability, right? Because it's a uniform environment. You don't treat tests as, you know, a special case like today, if you run a Kubernetes, that's the Kubernetes setup where you run your application and then you have to manage the database using different tools and technologies and methodologies. If you're running everything in Kubernetes, then you have a uniform set of management tools that work the same way. You want to add something, just push the ambles. Utsav: [1:08:40] That makes sense. And then you have to manage draining and basically hydrating nodes or pods. And you, do you have a custom logic to set that up? Sugu: [1:08:49] There is actually an operator that actually manages these things. We are actually trying to simplify that, you know, because we test, does have a lot of things for that, that we built when we were at war. I feel like at least our planet scale operator does a lot more than it should. It doesn't have to but yeah. There's some, a little bit of blue code is needed. I would say Utsav: [1:09:18] That makes sense. Yeah. But it certainly sounds like science fiction reading all of your data running on these pods that can be moved around and changed. Anytime you don't even know which machine has all of your data, it sounds in the future. Sugu: [1:09:34] Yeah, it sounds like the future, but it is the future because there is no other way you can manage 10,000 nodes, right? Like if you're, if you're going from the unicorn world where there is this one server that you have to watch carefully to, it's going to be everywhere. It's going to be lots and lots of servers. How do you scale it? So, there's a term that one of our engineers used to use, you want off one manageability, which means that, as you scale the nodes, you shouldn't have to scale your team. Utsav: [1:10:12] That makes a lot of sense. Yeah. And what does your day-to-day job look like now? Like, what is your role or how has it changed over time? I'm guessing initially you must have already got it Sugu: [1:10:21] Every time. Yeah. Like, it keeps changing. So, like until, initially I spent like 99% of my time on with deaths, like the early days of planet scale, the last seven, eight months, I spent a lot of time on our internal operator because there was one particular part that was broken that I said, okay, I'll go fix it. So that I'm actually winding down. So, I have to, so I, like, I have a feeling ... eventually what will happen is I think I will have to spend a lot more time publishing stuff. I have not been publishing enough. Like people have been asking me, when are you going to finish this consensus series? I still haven't had time to do it. So probably spend more time doing that, and probably also speak at conferences. I scaled that back a bit because I was too busy, not writing code internally, so it keeps changing. So, it will remain like that over because it's the way I see it is whatever is needed. It's like PlanetScale, right. Whatever is more important. That's the problem I am working on. Utsav: [1:11:44] Yep. That makes sense. And in some ways, it's like going back to your role at PayPal, they're just working on the important projects. Sugu: [1:11:52] Yeah, yeah, yeah. Keep it afloat. Utsav: [1:11:59] Yeah. And maybe to wrap things up a little bit, like I have one question for you, which is, have you seen the shape of your customers change over time? Like, you know, I'm, I'm sure initially with, with testers, like super large customers, but have you started seeing now with all of these usability improvements, smaller and smaller teams and companies like, you know, what we just want to future proof ourselves. Sugu: [1:12:18] Yeah, totally. It's night and day. That type of like, you know, some of like, even the words they use are very different. Like things like, you know, previous customers are more like very correctness, durability, scalability, and now like the newer customers talk about, you know, these two factors are not working, you know, I have this Ruby program. I have this rails thing. Even the language is very different. Yeah. It does change. Utsav: [1:12:54] Well, I think that that's a great sign though. Cause you're basically attracting a lot of people who traditionally would have never even thought about using something like with ESSA plan scale. So that's a great thing Sugu: [1:13:04] To hear. Yeah. Yeah. Yeah. Utsav: [1:13:08] Anyways, well, thanks so much for being a guest. I think this was, yeah, this was a lot of fun for me. I love that trip down memory lane and understanding, knowing how the product is supposed to work and yeah. Again, thank you so much. Sugu: [1:13:12] Yeah. Thank you. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Tammy Butow is a Principal SRE at Gremlin , an enterprise Chaos Engineering platform that makes it easy to build more reliable applications in order to prevent outages, innovate faster, and earn customer trust. She’s also the co-founder of Girl Geek Academy , an organization to encourage women to learn technology skills. She previously held IC and management roles in SRE at Dropbox and Digital Ocean. Apple Podcasts | Spotify | Google Podcasts In this episode, we talk about reliability engineering and Chaos Engineering . We talk about the growing trend of outages across the internet and their underlying reasons. We explore common themes in outages, like marketing events and lack of budgets/planning, the impact of such outages on businesses like online retailers, and how tools and methodologies from Chaos Engineering and SRE can help. Highlights 01:00 - Starting as the seventh employee at Gremlin 04:00 - An analysis of recent outages and their root causes. 09:00 - A mindset shift on software reliability 14:00 - If you’re suddenly in charge of the reliability of thousands of MySQL databases, what do you do? How do you measure your own success? 25:00 - Why is it important to know exactly how many nodes your service requires to run reliably? 30:00 - What attracts customers to Chaos Engineering? Do prospects get concerned when they hear "chaos” or “failure as a service”? 43:00 - Regression testing failure in CI/CD 51:00 - Trends of interest in Chaos Engineering over time. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Itiel Schwartz is the CTO and Co-Founder of Komodor, a Kubernetes troubleshooting platform. Previously, he was one of the founding engineers at Rookout . Apple Podcasts | Spotify | Google Podcasts We discuss two major themes in this episode: the rise of Kubernetes as a popular orchestration platform, and the need for using an integrated service to understand and debug Kubernetes deployments. Highlights 9:30 - When should a startup consider using a more heavy-duty system like Kubernetes, vs. managed platforms like AWS Fargate? What are the advantages of using Kubernetes over these platforms? 18:00 - What are the new developments in the Kubernetes world? Why it may make sense in the future to run stateful services like databases on Kubernetes. Open Policy Agent 25:00 - The motivation behind starting a Kubernetes focused company 28:00 - What’s the biggest gap that Kubernetes users face while debugging their deployments? And how does Komodor help with that? 39:00 - The surprising rise of Observability teams across smaller companies This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Tramale Turner is the Head of Engineering, Traffic at Stripe. Previously, he was a Senior Engineering Manager at F5 Networks and a Senior Manager at Nintendo. Apple Podcasts | Spotify | Google Podcasts This episode has an unexpectedly deep dive into security and compliance at Stripe. We discuss Stripe’s philosophy and approach towards building secure systems, achieving compliance standards like PCI , and complex requirements like data locality laws. Highlights 05:00 - Growth at Stripe 09:00 - A sampling of challenges involved in being a payments provider 11:00 - Stripe API traffic is much lower than the traditional large companies with Traffic teams like Google/Facebook/Netflix. Why does Stripe need a Traffic team/group? 16:00 - Stripe’s innovative approach with an embargo for credit card numbers from most of their platform. Idempotency keys . 20:00 - Compliance automation at Stripe! 30:00 - Should the entire organization need to know and care about compliance? Or should teams provide internal platforms to abstract away compliance concerns? 36:00 - Data Governance and locality laws. 45:00 - Security’s relationship with Compliance, and how Stripe thinks about security 53:00 - How to build teams that need to achieve such lofty goals? Transcript (Best effort. Find my contact info at /about to report any errors). Utsav Shah: Hey, Tramale, welcome to another episode of the Software at Scale podcast and thank you for joining me. If you could tell listeners your background and, first of all, the origin of your name, which I think is extremely interesting. Tramale Turner: Yes, sure. So first of all, thank you for having me. My name is Tramale Turner. And the origin of my name, I think, unless my father was lying to me, is as follows. When I was born, he had this image of the Magi in his head, the three wise men as it's colloquially known. And he thought, “Okay, three men, tri-male.” Didn't like the “I” and so he changed the “I” to an “A”, and I thus became Tramale. Utsav Shah: Cool, very cool. And so a lot of your experience has been in the networking and traffic space. Right now, you're the Head of Engineering of Traffic at Stripe. And previously, you worked at F5 Networks. So what got you interested in this space? If you could just tell us a little bit about your story? What got you into this space? And what do you think of it? Clearly, you like it? Tramale Turner: Well, it's interesting. In fact, I hadn't been in specifically this functional area for the majority of my career, but it has been my focus for the last, let's say, five years. I started off as a software engineer. I was a software engineering student at the University of Pennsylvania. I left Penn after my third year, moved to Japan, and worked as a software engineer in Tokyo, working on what I think we now call digital marketing and or web development, or sometimes interactive media, but didn't really have a name at that point. And I was designing software for interactive CDs, for websites and what have you. I ultimately left Japan, came back to the United States, was in the Bay for a little bit, commuting between San Francisco and Tokyo, then moved back home to start a company founded by myself and with investment from the folks that I worked for in Japan. So I started a series S corp in Michigan and built this product called [Inaudible 02:32] , and [Inaudible 02:33] in Japanese just means the Festival of the gods. [Inaudible 02:37] was, for all intents purposes, another social networking service. But this was the age of GeoCities and Yahoo, and social networking services at scale was not really a thing, so to speak. And so our idea, which we thought was novel, was to build an interactive, full media experience where you could have communities and create communities. My co-workers, the people I hired, and the people who, in fact, invested all met on IRC. And so we have this image of people just being able to have real-time experiences and real-time chat in our mind as we were building this. And it was a pretty fun experience as you might imagine running a startup and being sort of the lead technical person. I was way too young to be a CTO of any shape, but I effectively was that. And we showed that off at Macworld and got an investor interested in actually buying the IP. That person bought out the company. I made a little bit of money, bought a house, then went through a couple of failed startups also, in this domain of interactive media, interactive marketing. Ultimately ended up at the Volkswagen Group. Spent nine years at the Volkswagen Group working initially in marketing, strangely enough, working for the Volkswagen brand, and then transitioning more into technology. Traveled all around the world for Volkswagen, mostly all around Germany, and then Latin America. Moved to Puebla, Mexico for a year and then to Herndon, Virginia for almost two years before I left Volkswagen and join Nissan again at an advertising agency captive inside of Nissan and Franklin, Tennessee. I lived in Nashville, commuted to Franklin, and worked on the Infiniti brand. Did that for six whole months before I got a ping from this little video game company in the Pacific Northwest called Nintendo, your listeners probably have never heard of that. Went to Nintendo working as an engineering manager for this consumer online and publishing team as it became known, working on payment services, account services, and developer support. So if you've ever used a Wii U, 3DS, or a Nintendo Switch, and you've had an NNID or Nintendo Network identifier, that was the service that my team built, and that I helped create, and also responsible for payments within the e-shop, and so on. So left Nintendo, joined F5, spent 13 pretty fun months at F5 building a brand new set of teams therein when I got a DM in my LinkedIn from the Seattle Site Lead for Stripe who said I should come over for lunch. That was two and a half years ago. I went over on a Taco Tuesday, and I never left and have been a part of the Traffic team and became the leader of the Traffic team for my entire tenure at Stripe, two and a half years now. Utsav Shah: Okay, how big was Stripe? I have, first of all, so many questions, especially around your experience with Nintendo. But I'm going to ask you how big was Stripe when you join it? And how big do you think it is now, just approximately? Tramale Turner: Yeah, when I joined Stripe, it was approximately 1000 people. And it is approximately four times the size of that now and growing so much. Utsav Shah: You’ve grown so much over the last few years. Tramale Turner: Yeah. Yeah, especially the last 18 months, I would say, we've experienced phenomenal growth. I can't say specifically how many people are there now but as I said, about four times an increase, but still larger than that and growing rapidly. Utsav Shah: Did your experience at Nintendo, especially working on the payments platform and stuff make you I guess recognize how important the problem is? Or how hard and complex the problem that Stripe is solving? Or is it just like, a combination of things? Tramale Turner: I love this question. So when I got that invitation to lunch, I of course knew what Stripe was. I follow patio11 on Twitter and had seen his posts and Patrick Olsen's posts on Hacker News, but I pretty much in my mind just saw Stripe as a payment services provider or what we sort of colloquially call a PSP. They have a payment gateway, they connect you to credit card acquirers, and no big deal, you can accept credit card payments online. And I had dealt with Chase payment tech. intimately while at Nintendo and so yeah, I was very familiar with how the structure of those agreements worked and both the utility and some of the failure modes and fault domains that exist when dealing with a payment gateway. And that's what I walked through the door thinking that I was going to experience. Someone was going to talk to me about joining this thing that's going to, you know, in their words, I'm sure, change everything or transform it. But what I found out was something completely different, and it's super interesting. The person that I was having lunch with that day, a gentleman named Brian Delahunty, was talking to me about Lyft, who is a user of Stripe, and talking me through all of the different use cases. And I won't belabor the point all of the various products and services that Lyft makes use of, but as he was talking, a light went on, and I started to see, “Oh, it's not just a payment services provider. This is a company that's trying to democratize access to economic enablement. And not trying to do it just for a certain segment of the population, like startups or developers or large enterprises. It's for everyone, like literally anyone that has access to a connection and can get access to the API,” which is, again, if you have a computer or a phone and have connectivity, you can do that. You can then create a business. And what really occurred to me in that moment was, you could build something that could support putting food on your table every night, without a whole lot of effort. And when I got that concept, when it started to make sense to me, I said to myself, I couldn't imagine not being a part of it. Utsav Shah: Wow, that's a great story. And I think there's also a lot of complexity there in something as simple as a Lyft transaction because the person driving the Lyft might be from some other country, the company Lyft is incorporated in some other country, and all of the rules and regulations surrounding all of that, making sure you have enough capital, it just sounds like such an interesting technical problem, plus, you're helping the world. Tramale Turner: Yeah, I mean, to your point, right. So if we can, because it's important, let's just quickly break it down. You as a platform provider, who has perhaps these 1099 or whatever the regulation is within your country, independent contractors who are driving on your behalf, you want to be able to accept funds somehow, right? So you need the scaffolding so that people can pay you money, right? And you can get some remunerative benefit from the service that you're providing. You want to be able to identify those people, to your point, right? You don't know who that person is when they say they want to be a Lyft, driver, an Uber driver, or Gojek driver, or whomever. And so that identity question is super important as well. You want to be able to manage the platform, right? So being able to do the core service that you enable, in Lyft’s case, being able to enable carshare. But with that rideshare, the core primitives for enabling rideshare, you also have to worry about the saving up movement of money. And oh, if you're operating in different countries, what about foreign exchange, right? And then you have to pay those people. And you want to make sure that you have enough money in whatever account that you're managing to do that. And those people who get paid, maybe they want to do that instantly, maybe you have a payment method by which they can have whatever money they've made in the day, immediately transferred to that payment method, maybe a debit card, right? And then they can go, as I said, to their local bodega or to their grocery store and buy the groceries that they need to put food on their table that night, right. And so when I saw that, in real time, and then it started to make sense to me, I started to see Stripe for so much more than what I think most people perceived it to be. And I think the story is starting to, you know, speak volumes for itself these days but there's still a lot more that we can teach the world about what we have to offer. And I'm really excited about the opportunity to do that going forward. Utsav Shah: Yeah, I think the mission, increasing the GDP of the internet, it's so simple, but it makes so much sense. Yeah, so I've seen some of the numbers that Patrick has posted about on Twitter, and the API volume of Stripe, it’s charges, right? It's not going to be, like QPS2, a free service or anything like that. So I've seen it roughly translated like five to 6000 QPS. That's what I've just done the math from whatever numbers he shared. And I'm sure it's maybe twice or thrice of that. But it doesn't seem like an inherently super high traffic service. And maybe you can correct me if I'm wrong, and I know you might not be able to share everything publicly but at that point, when you joined the team, why was there a need for a Traffic team? What was the goal of the Traffic team? And how has that goal expanded over time? Tramale Turner: Yeah, I think it's a really interesting question. You're wrong about the number of QPS, but that's okay. It's okay to be wrong. I won't tell you specifically what that specific metric is but what I will say, which I think is both something intuitive, but also important to consider is that when you're thinking about user attachment to a service, typically you're thinking about, I mean, in this modern age, we think about hyperscale services. So we think about Facebook, or we think about Netflix, we think about users who are engaging at volume, or at scale, as we like saying the industry, with consumer-oriented, sometimes skewing towards entertainment services, right. And so making sure that you have cash content as close as possible to that user, think about the Netflix core team, for instance, and the work that they do to make sure that the CDNs and the edge of the network is performant and robust and resilient. Those edge network teams that they have, and they do have multiple teams, they sort of make sense, again, intuitively. But the other thing that you would consider is that if you have to do a retry, or if there is a network partition or packet loss of some shape or form that, you know, you kind of for free, get with TCP/IP guarantee of delivery, right? So you can retry again, TCP/IP, layer seven will sort of take care of some of the nuance of what does retry actually look like or what does effective and efficient packet delivery look like. And so you can sort of fake it and you can buffer streams or you can sort of jitter delivery of content. And all of that makes sense again, for most folks, I think and certainly, people who are listening to this podcast are very familiar with the vagaries of content delivery and billing things like CDNs. But when you start talking about money, it becomes a completely different game. So it's not so simple to just retry a request, because you may inadvertently double charge someone. That's a really bad thing. Or you may inadvertently double pay someone, also really bad. Really bad for the user, really bad for the organization providing the service, and potentially in violation of regulatory restrictions or regulations that you have to comport to in order to continue doing your business. So when you ask why, or when did a “Traffic team” become an integral part of Stripe’s engineering organization, from the very beginning. Absolutely important to understand what happens at the edge of the network, terminating TLS because, of course, everything is TLS encrypted and protected, making sure that whatever was in that payload goes saliently to its destination, and that the response goes saliently back to the requester. So that sort of table stakes. But you also want to be quite careful because you're collecting credit card numbers. And so most payment services providers have something called a cardholder data environment. That's where all of those pans or the primary count numbers, the credit card numbers that you have sort of sit at rest. In most cases, at least in businesses that are doing it in a PCI-compliant way, which is hopefully all of them, you never want anyone who doesn't need to see that primary account number, have access to it. And so how do you do all of the work that you need to do to make the credit card “part of the business” work? You send forward probably some tokenized representation of that credit card number. And that's what the majority of the business deals with, only that tokenized representation. And then they communicate back to the cardholder data environment and the cardholder data environment will communicate with the acquiring partner, and the banks and to make sure that there are funds available, and that they can commit the charge. And that is part of what the Traffic team is responsible for, amongst many other things at Stripe. Utsav Shah: Okay, so that's super interesting. Let me just clarify this. So there is some service or something at the edge, which takes the actual credit card information, does some kind of hash or tokenization of it, and then the rest of the service [in the factory 17:27] at Stripe never has to worry about potentially dealing with PCI because they will have to at some level, but they don't have to worry about actually holding the credit card information because your edge service has taken care of that. And there's only one data store somewhere that knows how to map from your token to an actual credit card number. Roughly, does that sound accurate? Tramale Turner: Yeah. So the only thing I would correct there is that there is a service that understands the semantics of translating between an actual credit card number and a tokenized credit card number. But where are those credit card numbers to that rest, let's just say, without going too much into specifics, that there is high resiliency and robustness of persistence to make sure that, one, there is as little latency as possible for the transaction. Because we want the user and the user being the partner of Stripe, the person that signed up to Stripe, the organization that signed up to Stripe, and their customer to have a really good experience. We want wherever they are on the globe for them to imagine that Stripe might have been founded in their country because it's so fast and because it's so effective in closing those transactions. And so how we persist, and how we translate from raw pan to tokenized pan is a really fundamentally interesting distributed systems problem that I think we're pretty darn good at executing against. But yeah, effectively, your summary is broadly correct with, as I said, a couple of nuances that I would correct. Utsav Shah: Okay, that makes sense. And I've also seen speaking to the other problem you were talking about with requests not needing to be doubled, it's really bad to retry a request without thinking too hard about it, because you don't want to double charge people, or you don't want to double pay someone. I've seen something interesting in the Stripe API Docs about item potency keys, where you let users specify a key. And I'm guessing what that means is some service at Stripe maintains a database of every single request that comes in. And from the API Docs, it looks like y’all garbage collect after like 24 hours or something. So you need to store every single request that comes in, if it has an item potency key, and that's how you make sure not to retry things. Tramale Turner: That's exactly right. You got it. Utsav Shah: Yeah. And is that something that your team does? Tramale Turner: We support the effective routing of those API calls that are item potent. But there are other teams that are actually responsible for making sure that the integrity of a call that has an item potency key attached to it fulfills the service requests, and to make sure that whatever mutation occurs only occurs once. Utsav Shah: Okay, so what else am I missing? Is there anything else that is interesting like that edge service you're talking about, which you have to think about, as--? Let's say that you are pitching to an engineer about the technical challenges that your team has to face. What can you talk about publicly? Tramale Turner: Yeah, so what I like to talk about is the fact that regardless of what any candidate has heard about an edge team in other organizations, and there are, I think, to your earlier point… well actually, I don't know that we covered this point but there are edge teams sort of spiraled throughout the industry. I brought up Netflix, but there's the GFE team, the Google Front End team at Google, Amazon has a similarly shaped team that works on edge primitives. All of the hyperscalers do. Many of the scaled, if you will, used to be startups and are now big companies, Spotify, Shopify, you name it, they probably all have something similar. Even Lyft has an edge team. Typically, those teams are very small, because they have a very targeted and very specific set of services. Many of them that I've experienced, work on things like, you know, maybe there's an envoy sidecar proxy that at the edge, they want to make sure they're delivering service requests to some service mesh effectively and efficiently. And they worry only about again, that initial TLS termination and making sure that whatever they have instantiated at the edge is robust, resilient, scalable, etc. If a team goes deeper, like I think the Netflix team does, maybe there are several teams within the edge construct that worry about CDNs, worry about API gateways, and worry about also just making sure that there's effective service to service communications, and maybe they worry about some other core distributed systems primitives like leadership election, and what have you. Our team does all of those things. And in addition to the network sort of core features, functions, and primitives that we proffer to the organization, and to make sure that the API is highly available and resilient, we also worry about this notion of compliance. And why is that? So I mentioned that we're responsible for the cardholder data environment and the cardholder data environment is the thing that makes Stripe PCI compliant. And PCI DSS is a compliance regime that is a consortium of acquirers and a lot of folks in its current state that care about the integrity and the security of folks’ credit card information, and to make sure that as you're building services online, offline, the amalgam of the two, that you're doing so safely. Because what no one wants to ever have happen is that you're on the front page of the news because you leak credit card numbers, or someone was able to break into a system and exploit those numbers and then trade them on the dark web or some horrible story like that. So every organization that deals with credit cards has to have some level of PCI compliance, be that online or offline. And we are really good at the PCI compliance motion at Stripe, I would say, fairly exceptional. And the team that looks after a lot of that work, from an engineering perspective is the Traffic team. That's right. So because we got really good at that, the organization sort of looked at us and said, “Would you be interested in thinking about how you can help remove the toil of a lot of these other compliance and regulatory concerns we have?” So sock one, sock two, future things that we might be thinking about, that I can't talk too directly about here. But as an example, there are things like FINRA, HIPAA FedRAMP, all types of regulatory concerns. And what's interesting about those regulatory concerns is that they all require some level of evidence to show that you are compliant with the regulation. And that evidence collection motion, when you start to zoom out and look at it, a lot of the things that are being asked were quite similar. And so as engineers, you look at that and say, “Oh, well, these are declarative statements. And when we have declarative statements, that means we can probably automate the thing. We can come up with a state machine or we can come up with some system by which we are automating the collection of this evidence and maybe even automating the reporting of our attestation of the correctness of that evidence. And so that's one of the things that the Traffic team at Stripe also looks after. And then we have yet another thing that we're working on that is kind of in the shape of an infrastructure primitive as a product. And I can't talk too much about that, but I will say that it is not completely orthogonal to that notion of compliance and that notion of just being very assertive about protecting the integrity of information that one might share with one's customers as an organization, as a business. And so I'm hopeful that we'll be talking about that in the next less than a year, I hope, depending on the velocity by which we build these primitives, but it's something I'm super excited about. And I really can't wait for Stripe’s users to hear about it. Utsav Shah: Cool. And I don't know if you've ever spoken about the thing that I'm doing right now. I'm working for a compliance automation company. Yeah, I don't know if we've ended up talking about what I've been working on since we spoke last. I've been working for a compliance automation company, so all of this sounds super exciting to me. And it's interesting that the Traffic team has ended up in charge. But I guess that somewhat makes sense as well, given that y'all are the team that has to worry about making sure your integrations in a sense, with the rest of the company are compliant. I can also totally imagine that… how do I frame this? Any sort of regulatory issue should be caught at the earliest layer possible, and that could end up being the trafficking. You can imagine if there is a payment that is being made, again, in your Lyft example, in a way that isn't compliant, you don't want to find out right at the end. You want to find out right at the beginning. I don't know if that makes sense. Tramale Turner: No, it totally makes sense to me. And I think yes, and… I think that when we think about compliance, one of the things that a lot of organizations in my experience tend to do is, well, everyone in the organization sort of steps away from compliance, and they're like, “Oh, that belongs to the regulatory--" Like, maybe there's literally a compliance organization typically reporting to the CFO or to the chief legal officer, and the engineers in the organization sort of sigh and they're like, “I don't want to deal with this toilsome evidence collection process. It's just super disruptive. And I understand it's necessary, but it's not something that gives me joy.” And what I love about that is that it's boring. Boring, but everyone who participates in this ecosystem, and not just for payments, as I mentioned, like, if you're doing health care information, you're dealing with HIPAA, if you're trying to sell to the federal government, you're dealing with FedRAMP, and so on, and so forth. So I mean, you know this because you're working on a startup that has seen the opportunity, so much opportunity in this space. And what I'm working on is just making sure that we're really good for Stripe internally, with all the compliance and regulatory motions that we have to comport to, but I totally see, totally see the opportunity for platforms and products. In fact, we know this to be already something that there are many companies that have tooling and or platforms and services that support. So GRC tooling, which people may be familiar with, which is governance, risk, and compliance, there are a ton of vendors. ServiceNow is one of the biggest SaaS vendors, for instance, that has a GRC tool where evidence is supposed to rest and then you can use that resting state of evidence to support these different regulatory needs and concerns. So there's opportunity here, and I'm excited to just have a team that is really well versed in sort of the complexities of compliance, as well as having the experience to know how to build let's say primitives, then services and platforms to help accelerate getting a lot of that work done. That very necessary, but maybe very boring work done. Utsav Shah: So in terms of goals, would it be success if no one in the rest of the organization had to care about compliance and it all got kind of platformized by your team? Or would it be success if people still have to do it, but it's super seamless for them? And maybe if you could just talk concretely like how do you enable compliance when you have like a 4,000 person company that has to think about it pretty much because it's bread and butter for your business? Tramale Turner: Yeah, I think that's an excellent question. I think success for me, looks like being very rigorous around understanding what the toil cost of all of the compliance motions that we invest in today are. Again, these are stay-in-business motions, they're not things that are optional. They're not things that you can choose not to do. If you want to stay in business and you want to continue doing business in certain markets or within certain industries or proffering certain services, you must do these things. And so it's a really easy equation to look and see, “Okay, how many people are committed to--" You know, let's constrain it down. Let's just talk about PCI DSS. “How many people are committed every year to collecting evidence, cleaning evidence, sitting with an internal auditor, sitting with a QSA, and making sure that everything that we're supposed to be doing to secure this data, this credit card data that we're holding, is correct.” That's literally the question. Is it correct? And you can look at all of the effort that you expend against that, and then start to see, “Okay, what parts of this are redundant? What parts of this are things, to the earlier point, could be collapsed into declarative statements that we could tell a piece of software to go and do?” So a really simple, just very basic version of an example of that is, one of the things you want to be able to see is access logs that tell you when a persistent store that has sensitive data was accessed. You could easily have an operator go to a system, type a set of commands, copy those commands, show the date, and show that the operator is pulling that information to show those logs of when something was accessed and or modified. You could also ask a computer to do that. And the computer could probably do it more effectively, more efficiently. And with that saliency of automation, you can then start to see, “Okay, what is the time we just saved from having the operator do that?” So those are sort of like the basic conversations and the really easy starter primitives that one would want to start building around, like the automation of evidence collection. Where I think it goes further is, “Okay, well, can you also automate the reporting piece as well? Can you take all of that data that you've collected, if you're collecting it regularly, and just on-demand generate the attestation of compliance?” I think you can. I think that that's actually something quite feasible. And if you get to that point, and you've solved it for PCI DSS, can you solve it for PCI P2PE? Can you solve it for FedRAMP? Can you solve it for FINRA, etc, etc.? And my conjecture at this moment - conjecture, because I'm improving it - is that you can do that, and I am solving that problem internally for Stripe as part of the infrastructure organization. And I am hopeful because we all tend to be quite ambitious at Stripe, that that's something that if we get really good at it, who knows? Who knows what we might do with it going forward as a potential service offering? Utsav Shah: That makes a lot of sense. And you can imagine, I think everybody that needs to maintain credit card information needs to be PCI compliant. So offering that as “You know what? If you have Stripe, you can also get reports that prove that you've been compliant with the rate.” I don't know how feasible that is because I don't have any familiarity with PCI, but it seems like that is certainly an approach that is exciting. It's also scary because you'd be a competitor of the company [Inaudible 34:30] and I don’t want that. Tramale Turner: Well, I must say, just to be clear, for users of Stripe, we do handle PCI compliance on their behalf, but there are different levels of PCI compliance and there are different types of PCI compliance. I mentioned P2PE, which allows you to do things like tap on a phone or use the NFC chip within your smartphone to have it act as a point of sales device. I think that you're going to see as different types of online or near online, I think what we traditionally call offline, but they are offline only in as much as there's a person who is present, but everything that's happening after the payment method is made available to the retailer, is something that's happening in the digital space, happening online. So I think as you see the proliferation of those things, maybe not so much in North America, because North America is just strange in the world. But certainly, globally, there are all kinds of new payment methods coming up every single day. You're going to also see governments and regulatory bodies saying, “Okay, we kind of want to make sure that we have a handle on what's happening here because we want to protect our consumers, we want to protect our citizens, in many cases, to make sure that they're not being taken advantage of or that we're not allowing the proliferation of crime or criminal enterprises by virtue of these new payment methods that are coming online.” A great example of that is what we've seen with digital currency, or crypto recently, and its enablement of some crime vectors that clearly are not optimal. And regulatory bodies are trying to figure out in the present moment, in real-time, right now, how to put some controls around some of those criminal enterprise vectors. Utsav Shah: Yeah. And as part of maybe a follow up to that there's also-- All right, this is from my conversation with Emma at Stripe recently, have you noticed, or is your team also thinking about data governance stuff - credit card information of a particular country should only live in that country - because it seems like there's more and more laws around that? Is that kind of stuff or something that you have to think about? Tramale Turner: Absolutely. So it's already public knowledge because I think there were several articles published about it, but one very deliberate way that we're making investments in that space-- And just to sort of add clarity, for folks who may not be familiar, there is a notion of data residency, data locality or data sovereignty. And what does that mean? That means that a regulatory and or political body might say, “For users who are using online services to make payments and to buy goods and services, who reside within our municipality, our country, whatever domain that we control, we would want their data to remain within the borders of the country.” Or if there is a processing function, some type of compute that's occurring, where you're capturing that payment method and doing something to remove or add funds to it like that, that processing must happen only within the borders that they control. And they may say, at least on the surface, that they're doing that for, indeed, to protect the user from potential criminal activity, to be able to easily have access to those services, should there be some nefarious activity, and they need to collect evidence from those services much more saliently or easily. And then there, if you go deeper, is something of the notion of protectionism. So there may be other competitors within that country that are trying to compete with more globally established players, and the country or the regulatory body may want to support their own natively grown industries in order to allow them to scale and or to provide, maybe to some definition of better, “better” services to their citizenry. So that's what we're talking about. And Stripe recently - and I should caveat that recently means on the order of two and a half years - has been thinking about this for India. So India passed regulations that basically said that any payment service provider that's operating within the country has to keep any Indian citizen’s data that is used against those services resident within the country. Simple as that. So what does that really mean concretely? If I am shopping online with my credit card, I'm in India and I'm an Indian citizen, or I'm using RuPay or I'm using some other payment method that is either global or local to India, any information pertaining to that payment method, if it initiates in India must stay in India. And there are some caveats like you have a little bit of a grace to process outside of the country, but any data that is processed can't persist outside of the country for more than 24 hours and things like that. And so if you've built, for instance, on a hyperscaler, and let's say all of your services are in US-East-1. If you're using AWS, that presents an issue, right, because you now need to figure out how to move those services regionally. And then not to degrade the services of any of your users who are not in that country, but also for the users in that country, make it somewhat seamless and make it also seem as though all of those services that you proffer to everyone else are available wherever those sort of more constrained borders might be. And so that's a tough problem. It's not just a tough problem. It's actually a really difficult problem. But Stripe recently announced that we solved that problem for India. And if you're watching the FinTech news, as I think many folks are these days, you'll see that we'll be making future announcements about other locations that we've solved it for in the coming months. And it's something that we have been investing a lot of time and rigorous effort in making sure that we do correct. Again, not just for the countries that have data sovereignty rules, either pending or already in existence, but also to make sure that we don't degrade services for the modulo those countries rest of the world. So I really love this topic because I worked on India. And I'm really happy with not only how we addressed the issue, but the many excellent engineers that just dove deeply into the problem, and rigorously worked to affect a really nontrivial change to, I'm hoping, the light of our users in India. Utsav Shah: That just sounds so amazing. Stripe sounds like to me as a compliant state machine where you have to continuously inject laws and changes in laws and make the system work. I don't know if you could talk anything about the actual implementation. So I know, at a very basic level, you have to run some, or a significant percentage of services in India. And that makes sense. I don't know if you can talk about anything else, or where the complexity of the implementation comes in. Tramale Turner: Yeah, I can't talk about specifics, but I can talk about what the-- I think any reasonable person and certainly anyone that's dealt with distributed systems and understands convergence and understands, I think, misnamed eventual consistency, it’s eventual convergence really, I think the thing that makes all of this difficult is state, at the end of the day, where does data rest and making sure that you can attest that there is a compliant resting state of that data, per whatever the regulation says it needs to be. I mean, my friends who focus on the computing primitives will throw stones at me, I'm sure, but I actually see compute in actually my own area, networking, as much easier. It's not that difficult for me to come up with smart route maps and tag a request with sort of a locality primitive and say that, “Hey, I want you to direct this traffic only to these regions.” I mean, we've been doing that sort of, if you will, network, directive, and topological decisioning forever, right? Like since the inception of the internet, more or less That's not a true statement. But it's not that difficult. But dealing with making sure that data that you-- I mean, remember, these are corporations and corporations need to be able to close their books and understand how much revenue they've generated and understand if there are nefarious people trying to attack the system, how they're trying to attack the system. And you need reams and reams of data. And you need to be able to go through all of your data and understand what that data is telling you about what's happening within the system to get state about the system, which is why we call them stateful systems, right? That is a majorly, huge, huge, huge part of the issue. That's the big rock. And we at Stripe have spent a lot of time, I think, bending our core persistence primitives to a point where-- I don't know how every sort of storage problem within our industry looks in comparison. But I would say there are probably very few organizations that with the tools and services that we make use of, are pushing them to their utility limits as we are. I would argue that probably only the hyperscalers are doing the type of computer science, the type of distributed systems focus that we are investing in in order to comply with these regulations. And I know for a fact that it's a big deal at AWS. I'm here in Seattle and so I have a lot of friends that work at that particular hyperscaler. And I would imagine that my friends down the street at Google, also, similarly, are dealing with how best to comport and comply with these laws and regulations. Utsav Shah: Yeah, and maybe taking a step back from compliance, what compliance really [Inaudible 46:18] a bunch of things to make. What compliance means is, “Are you complying with our set of laws? But the laws are there for a reason. They're not there just the impede progress, for no reason. But a lot of compliance is around making sure that you're keeping your data secure, and also having processes around that. So the flipside of compliance is really security. And I'm trying to understand if there's anything you can talk about with regards to the kind of security practice and security measures and maybe security implementations y’all have to think about, because I'm sure you're dealing with tons of actors trying to steal data from your systems by just fuzzing the API, like “Maybe I can just get access to something that I'm not supposed to access.? How do you think about that? And how much is your team responsible for that? Tramale Turner: Yeah, it's a good question. You keep asking me these great questions that I can't actually give you direct answers to. Let me put it this way. We're dealing with money. And money, as we all know, creates an attractive attack vector for assailants. And I think, not just criminals who are trying to perhaps find a way to exploit the system for some remunerative benefit, but also people who would just want to deny others access to that democratization of economic freedom and enablement that I spoke to at the beginning, the thing that actually gets me up and excited about Stripe every morning. And so what I will say is that Stripe has already, and is continuing to invest in one of the best and most rigorous security teams and individuals building platforms and services to protect the integrity of our business that I've ever had the pleasure of working with. The leader of security at Stripe Niels Provos was a long-serving Google Googler, was a Google Vice President of Security, left Google to come to Stripe. And Niels has been building, I think, an amazing team of practitioners, that every time I get an opportunity to interface with them, and I do so frequently, I'm thoroughly impressed. And I actually consider my team to be an extension in some ways of the security team. Security is a different pillar at Stripe than infrastructure, but clearly, they conjoin and clearly, they're intersectional. And I am one of those points of intersection because I support this CDE, this Tier 0 service that is incredibly a lot of things. It's incredibly secure, for sure, but also incredibly important to the viability. Stripe doesn't exist without the CDE, so to speak. So one thing that I can talk about that we do that I think would be pretty obvious to anyone that deals with hyperscaled or highly scaled services, is that we are very concerned about what's happening at the edge of the network. To your point, who's sort of poking at the API? And what sort of things are they doing, layer three, four, or seven to try to either deny access to the API or to try to do strange things that the API doesn't expect? So you mentioned fuzzing but there are all types. If you were to look at request logs of Stripe, you would see all types of attempts at just doing strange escape vectors, trying to manipulate URL paths, using really odd URI access methods. It would not surprise, I think, most people, but would also really, really freak you out if you saw how frequently and how often folks are trying to attack just this one organization. And you think about all of the payment service providers that are out there. I am hopeful that they all have security teams and engineering teams more broadly, that are just as rigorous around security as we are. But again, to the point, this notion of denial of service is a big deal, we all know that. And so we make use of certain AWS primitives like AWS Shield to protect that layer three and four but for layer seven, we have sort of our own, it's not necessarily a WAF, a web application firewall, but it is a platform, a tool kit that we have to address different attack vectors that we have both seen and that we anticipate and expect against the API. And we insert primitives into that platform to allow us to do things like throttling or to deny API keys that maybe have leaked for whatever reason. Or if we see card testing, which is something that frequently happens within the industry - trying to test the card to see that it's usable and then to steal funds from that card once you find out that it is usable. We can identify that type of activity and then block it immediately. So that's just the sampling. But trust me, if I were able to, I could probably talk about for hours, many of the different types of security mitigations, remediations, and defense in-depth things that we invest in at Stripe. Utsav Shah: Yeah, so you're talking about web application firewalls. And I think, publicly, I've seen just systems that block based on IP addresses or certain URL patterns. And what you're describing, at least what it sounds like to me is a whole platform of being able to specify all of these different rules or these different criteria, like, “Oh, looks like an API token for someone has gotten leaked, or there's just suspicious activity,” and being able to automatically disable those things. And that makes a lot of sense. And also, it makes sense that you don't want to be DDoSed, since you have a lot of people depending on you for their business. And I'm sure when Stripe goes down, a lot of companies are upset and are calling you up. Tramale Turner: Indeed, we try not to go down. And I would argue we're pretty good at that. Our availability is staggering when you look at it, five nines. And we say for the network, and I don't want to over speak here, but I think just mathematically for the API to be available at five nines, you can imagine what the network, at least for minutely availability we tried to achieve. Just to be clear, I didn't state what that SLI is so no one at Stripe come at me. I didn't reveal that publicly. But again, we have very high, great expectations, as it were. I've been quoting a lot of Dickens at work recently because wherever I look, and whenever I see a plan, or whenever I see an ambition at Stripe, and I see how it implicates my team, I go, “Wow, we've got a lot of great expectations.” Utsav Shah: Yeah, maybe the last set of questions are around building teams to service these lofty goals. When you have a small team, and I'm just talking about like 50 people, 100 people in an entire company, there are a few trusted individuals that you can think about, you know, “This person will take charge of these things.” But once your company grows so big, you have to think about just expanding from individuals or even teams at that point. How do you make sure your infrastructure stays at five nines? You know, if there is a lot of attrition on the team, how do you set yourself up for success? And what kind of processes and what kind of people are you looking to hire? What are the things you're thinking about when you have to think about, you know, “How do I make sure that my APIs stay at five nines for next year?” Tramale Turner: I love this because there's no easy answer. There's no sort of canonical response that says, “This is what you do.” If there were, everyone would be copying it, and everyone would be doing it. Here's my experience, and then I'll tell you what I specifically do. My experience is that you, first of all, face the reality of what you just said, people are going to churn. At Stripe, for instance, we're huge proponents of internal mobility. And we actually encourage people to move teams after 12 months or so within a role, assuming, you know, decent performance. And I think the good thing about that is that it encourages teams to be very rigorous about their processes, about making sure that they have run books for their services, that they build highly reliable, highly resilient services, that they look to see if you will build a moat of protection around faults. The old head of infrastructure at Stripe-- I should say the former head of infrastructure, because he’s not that old of a gent, is Will Larsen and Lethain all over the internet. And Will wrote this book recently, An Elegant Puzzle under Strike Press. And I encourage folks to read it. I'll make fun of Will, here, I find Will's writing incredibly dense, and you'll probably have to read the book three times. But it's a good book. And Will talks a lot about organizational structures and dealing with the vagaries of what will happen within an infrastructure as it intersects with those organizational structures. So you're talking about how technical complexity meets organizational complexity. And the thing that you have to hold as a base principle is that you don't know anything. It is the sort of Socratic paradox, “I know that I don't know anything.” And for all of your experience, and for all of the things that you've done, you're not going to build the perfect most reliable, never faults system. So how do you get around that? You prepare for eventual failure modes, you prepare for understanding what those fault domains look like. AWS could go down. An AWS region could fail tomorrow. What do you do when that happens? And how do you recover from that type of epic disaster? Do you even have a plan? Do you have a run book? Have you done a run day? Have you done or game day or whatever your company might call it? Have you tried to test your assumptions and validate whatever mitigations that you have in place? And if you haven't, you're doing it wrong. And so what I do is I try to hire the best. Absolutely. I look for the best possible folks that I can and I never sort of rest on the laurels. I have a ton of great folks that work in Traffic, I'm still looking for greater folks because I want to constantly challenge the notion that we have solved for the problem set, the problem domain that we are responsible for. I listen, - I try to - as much as I talk and hopefully more so than I talk because it's really about learning about the environment, learning about experiences and learning about what people are observing, and then synthesizing a sort of mental model of what we should be doing and what we should be investing in, in order to address these risks. I test my assumptions rigorously. So we do game days, we make sure that if we synthetically fail something, how are we mitigating against that eventual occurrence, and so on, and so forth? So I think proper planning prevents poor performance, which is sort of something you steal from the MBAs is something that really is relevant to engineering and engineering disciplines as well. If you're not planning, then you're not doing it right. Utsav Shah: That makes a lot of sense to me. And the one question that comes from there is, how do you balance what's practical versus what should be correct on principle? A simple example is, you should make sure that your database backups work. I think almost no one would argue that that's something you should do. But then there are so many things that could go wrong. And thinking about the AWS example, I know a lot of companies would be like, “Oh, if AWS goes down, it's fine if I go down as well.” I'm sure that's not true at larger companies and companies like Stripe, but there's also a prioritization game that you have to play. So is there any framework you think about through then like, “How do I prioritize?” Is it just based on how much risk something has towards my business versus what's the likelihood of this happening? What else is there that you think about? Tramale Turner: Well, I mean, when you think about it from that perspective, it's almost like thinking about-- We were talking a lot about security today. And when you talk about security mitigations, you talk about threat models, and you talk about what's the probability of something occurring? And what's interesting about probability is that you're not saying-- I mean, it's very rare that you're saying the probability is zero, right? You're going to have some percentage of likelihood of occurrence. And so the probability of an AWS region going down is very, very low. But it has happened, it happened in 2012. We all remember the big EBS outage that took out US-East completely. It wasn't just US East one, it was like the entire East Coast went down. And it was a horrible day. Amazon learned a ton from that. And that shape of failure mode is not likely to occur again but the fact remains that a transit zone could go down, some core EBS primitive could go down, bad things can and will happen. And so when you talk about how to be principled around that, it's just acknowledgment of the fact that there's enough entropy in the world that random things, Black Swan events will happen. And if you are being rigorous about-- You know, I think, to the thing that you're driving towards, there's different levels of rigor and expectations depending upon the type of service that you're delivering. For us, we have to take every eventuality and every possibility quite seriously, again, because of the type of information that we're dealing with. We're dealing with people's livelihoods. We go back to the beginning of our conversation. If I fail to be rigorous about my pursuit and about the things that I'm responsible for, and that person somewhere in Bangalore, or somewhere in Kenya, or somewhere in Dublin, can't put food on their table that evening, that's my fault, I own that failure. And I have to have empathy with that user because I wouldn't want that to be me, and I wouldn't want to have an organization that my livelihood depends upon to not have that type of concern in mind as they're doing what they do every day. And so when we talk about principles, that's the way that I say my principles-- Like of course, Stripe has a canonical set of leadership and operating principles, as does every company of that scale. But I think even beyond the leadership and operating principles, which are very good, it's just about being human and caring about the fact that these services go beyond just bits and bytes, that they touch people at the end of the day. And so the humanist that’s in me says it's just about caring for my fellow man and woman and making sure that I do everything that I can to ensure that they have a good day. Utsav Shah: Yeah, that makes sense. And it's very easy to get lost in numbers like four nines or five nines, but when you translate that, the actual human impact, that's what really shows you the importance of the work that you're doing. Yeah, well, this has been a lot of fun. Thank you so much for being a guest. And I hope to have you for a round two, maybe in a few years when you can talk about some of these topics more publicly. Tramale Turner: All right. I will put that on the calendar. Thanks for having me. It was really fun. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Rajesh Venkataraman is a Senior Staff Engineer at Google where he works on Privacy and Personalization at Google Pay. He’s had experience building and maintaining search systems for a large part of his career. He worked on natural language processing at Microsoft, the cloud inference team at Google, and released parts of the search infrastructure at Dropbox. Apple Podcasts | Spotify | Google Podcasts In this episode, we discuss the nuances and technology behind search systems. We go over search infrastructure - data storage and retrieval, as well as search quality - tokenization, ranking, and more. I was especially curious about how image search and other advanced search systems work internally with constraints for low latency, high search quality, and cost-efficiency. Highlights 08:00 - Getting started building a search system - where to begin? Some history. 13:30 - Why we should use different hardware for different parts of a high throughput search system 17:00 - What goes on behind the scenes in a search system when it has to incorporate a picture or a PDF? The rise of transformers , not the Optimus Prime kind. We go on to discuss how transformers work at a very high level. 27:00 - The key idea for non-text search is being able to store, index, and search for vectors efficiently. Searches often involve nearest neighbor searches . Indexing involves techniques as simple as only storing the first few bits of each vector dimension in hashmaps. 34:00 - How search systems efficiently rebuild their inverted indices based on changing data; internationalization for search systems; search user interface design and research. 42:00 - How should a student interested in building a search system learn the best practices and techniques to do so? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Devdatta Akhawe is the Head of Security at Figma . Previously, he was Director of Security Engineering at Dropbox, where he led multiple teams on product security and abuse prevention. Apple Podcasts | Spotify | Google Podcasts In this episode, we discuss security for startups, as well as dive deep into some interesting new developments in the security realm like WebAuthn and BeyondCorp . We wrap things up with slightly philosophical points on the relationship between security and regulation. Highlights 0:00 - What got Dev interested in computer security? 4:00 - Security for a startup. What framework should a CTO use to think about security as their startup gets its first customer? 7:30 - Trends in the security space. Increasing customer demand for security due to the multi-tenant nature of the cloud. Lateral movement attacks. 12:45 - BeyondCorp. “There’s BeyondCorp, and YOLO NoCorp”. NIST’s paper on it . 25:00 - How should I think about a Bug Bounty program as a startup founder? - Having a good “Vulnerability Disclosure Policy” is an extremely valuable first step 26:30 - Why would anyone report bugs if they weren’t being paid for them? 30:00 - Interesting security products that companies might want to buy :) 34:30 - What is WebAuthn? 39:00 - How security and usability shouldn't be a trade-off 43:00 - Security regulations 47:00 - A repeat question - as a startup, what should I do to keep myself secure? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Laurent Ploix is an engineering manager on the Platform Insights team at Spotify . Previously, he was responsible for CI/CD at several Swedish companies, most recently as a Product Manager at Spotify, and a Continuous Integration Manager at Sungard . Apple Podcasts | Spotify | Google Podcasts Highlights 05:40 - How CI/CD has evolved from a niche practice to a standard and expected part of the development workflow today 12:00 - The compounding nature of CI requirements 14:00 - Workflow inflection points. At what point do companies need to rethink their engineering workflows as they grow? How that’s affected by the testing pyramid and the “shape” of testing at your particular organization 20:00 - How the developer experience breaks down “at scale” due to bottlenecks, the serial nature of tooling, and the “bystander effect”. Test flakiness. 28:00 - How should an engineering team decide to invest in foundational efforts vs. product work? The idea of technical debt with varying “interest rates”. For example, an old library that needs an upgrade doesn’t impact developers every day, but a flaky test that blocks merging code is much more disruptive 33:00 - The next iteration of CI/CD for companies when they can no longer use a managed platform like CircleCI for various reasons. 40:00 - How should we measure product velocity? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Sujay Jayakar was a Software Engineer at Microsoft Research where he worked on kernel bypass networking . He was previously a Principal Engineer at Dropbox where he worked on the migration of user data from S3 to the internal storage system ( Magic Pocket ), and the sync engine deployed to clients. Apple Podcasts | Spotify | Google Podcasts Highlights 05:00 - What framework do you use to decide to stop using S3 and store data in your own data centers? (the “ Epic Exodus ” story) 11:00 - Perfect Hashtables and how they are used in production systems to reduce memory use 14:00 - What is an OSD? ( Object Storage Device) . How does it work? 20:30 - SMR drives 30:00 - The actual data migration - how did it happen, and how does one validate that the data being transferred is correct. 33:00 - “S3 being overwhelmed”. That’s a string of words most software developers don’t expect to hear. What kind of overhead do kernels impose on networking, and why? 43:00 - What is Kernel Bypass Networking? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Colin Chartier is the co-founder and CEO of LayerCI . LayerCI speeds up web developers by providing unique VM-like environments for every commit to a codebase, which enables developers, product managers, QA, and other stakeholders to preview code changes extremely quickly and removes the need to spin up a local environment to showcase demos. This enables interesting workflows like designers signing off on pull requests. Colin was previously the CTO of ParseHub and a software design lecturer at the University of Toronto. The focus of this episode was on developer productivity, management of a CI system and company, and even a little bit of cryptocurrency mining. Apple Podcasts | Spotify | Google Podcasts Highlights 0:00 - What does LayerCI solve? 2:00 - CI is generally resource-intensive and slow. What makes LayerCI fast? A lot of similarities to the Android Zygote . We’ve even floated the idea of Python Zygotes at a previous job. 5:00 - The story behind LayerCI. 12:00 - The architecture that serves LayerCI. The cost of nested virtualization and each additional Hypervisor. OVH . 15:00 - Rate limiting. The impact of rising cryptocurrency prices on free tiers of CI providers - read more . 30:00 - The power of building high-quality infrastructure. How both developer tools like LayerCI, as well as low-code/no-code tools like Retool and Zapier are important for the future. 37:00 - Colin’s course for DevOps academy 47:00 - Hiring philosophy for startups This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Naphat Sanguansin was the former TL of the Server Platform SRE and Application Services teams at Dropbox, where he led efforts to improve Dropbox’s availability SLA and set a long-term vision for server development. This episode is more conversational than regular episodes since I was on the same team as Naphat and we worked on a few initiatives together. We share the story behind the reliability of a large monolith with hundreds of weekly contributors, and the eventual decision to “componentize” the monolith for both reliability and developer productivity that we’ve written about officially here . This episode serves as a useful contrast to the recent Running in Production episode, where we talk more broadly about the initial serving stack and how that served Dropbox. Apple Podcasts | Spotify | Google Podcasts Highlights 1:00 - Why work on reliability? 4:30 - Monoliths vs. Microservices in 2021. The perennial discussion (and false dichotomy) 6:30 - Tackling infrastructural ambiguity 12:00 - Overcoming the fear from legacy systems 22:00 - Balking the traditional red/green (or whatever color) deployments in emergencies. Pushing the entire site at once so that hot-fixes can be checked in quickly. How to think of deployments from first principles. And the benefits of Envoy. 31:00 - What happens when you forget to jitter your distributed system 34:00 - If the monolith was reliable, why move away from the monolith? 41:00 - The approach that other large monoliths like Facebook, Slack, and Shopify have taken (publicly) is to push many times a day. Why not do that at Dropbox? 52:00 - Why zero-cost migrations are important at larger companies. 56:00 - Setting the right organizational incentives so that teams don’t over-correct for reliability or product velocity. Transcript Intro: [00:00] Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Utsav: [00:15] Hey, welcome to another episode of the Software at Scale podcast. Joining me here today is Naphat Sanguansin who is an old friend of mine from Dropbox and now a Senior Software Engineer at Vise. At Dropbox, we worked on a bunch of things together like developer productivity, and finally on the application services team, where we were in charge of dropbox.com and the Python monolith behind the main website that powers Dropbox. Thanks for joining me, Naphat. Naphat:[00:47] Yeah, happy to be here. Thanks for having me. Utsav: [00:49] Yeah, I think this is going to be a really fun episode because we can remember a bunch of things from our previous jobs basically. I want to ask you what got you interested in working on Dropbox, on the main website? So there were a bunch of different things we were doing in the past, and at some point, you transitioned to work on various parts of the site. So what got you interested in that? Naphat: [01:15] Yeah, that's a good question. There are multiple factors but timing, I think, is probably the most important one here. So that was right when I had just moved to a new country, I moved to Ireland, I had switched teams completely, and I was sort of looking for the next big thing to sink my teeth into. And you’ll remember, at the time, Dropbox was also right at the time that they were trying to move the SLA from three nines to three and a half nines, I believe. And what that actually means is, they can only be down for 20 minutes instead of 40 minutes. And so that triggered a bunch of reliability problems, like some assumptions we had made before about how quick pushers can be, how many downtime we can get, things like that. They all failed hard so we needed to unpack everything and figure out how to do this correctly. For historical context, back then, there wasn't-- Let me just back up. So Dropbox is built on a large Python monolith server, we actually have a blog post about this. The way to think about it is imagine how you would build a server and if you had just come out of college. So you would probably start with a Python application, [the language 02:34] at the time. You might use some kind of Mythos framework, let's say Django. Dropbox doesn’t use Django, but let's just say something like that. You would find a bunch of endpoints, you write your implementations, and you would just ship the website into it. Dropbox started like that. It's definitely very, very similar to [Inaudible 02:54] company. And then we sort of grew that codebase. So we added a bunch more engineers, added a bunch more endpoints, and fast forward 10 years later, and you're at a place where we have a couple million lines of code in this monolith, we have thousands of machines filling this same copy of code, we have like 1000 plus endpoints. And at that point, the age of this monolith has-- Throughout the entire history of Dropbox, the monolith sort of went through multiple transitions because the age started to show each time. And then back in 2019, it really just came to a head, we couldn't keep going the way we were anymore. Up to that point, also, there wasn't really a dedicated team that owned this endpoint and so it hasn't received real investment in a long time, [Inaudible 03:54] . And now that they’re unraveling all these assumptions we made about it before, we needed to put the investment. And so I was asked if I wanted to come in and look at this and figure out what to do with it, and like, “Okay,” I have actually never looked at the product codebase at all. Prior to that, as you know, we were [Inaudible 04:17] actually were parallel for a lot of years. So prior to that, I was pretty much working on the development side of things. I was doing a little bit of CICD. And it became a new challenge. Lots of interesting things that I haven't thought about before, I decided to take it and see how it goes. And so that's sort of how I came to start working on this monolith, this Python server. And fast forward two years later, and here we are. So yeah, a lot of people think about monoliths versus microservice or something. If you had to start Dropbox [05:00], 10 years backward, if you had to start your own company today, with your experience, would you not start with a monolith? Or is it just that, at least to me, the opinion is that we just have to continuously invest in the monolith rather than getting rid of it. What are your thoughts? Naphat: [05:16] Yeah, I always believe that we should always do what's practical. And the monolith model served Dropbox really well for the past 10 years or something. It's really when Dropbox grew 1000 engineers, and 1000 commits per day, that's when it really started to breakout. If you're at a startup and you ever get to that point, you're already very successful. So I will say do what’s practical. If you start with microservices from the beginning, you're putting a lot of infrastructure investment upfront, whereas you could be spending time working on things like actually getting your product off the ground, making sure that you have product-market fit. So in a way, I would say the journey Dropbox went through is probably going to be very typical for most startups, maybe some startups will invest in it sooner than later, they might do continuous investment over the lifecycle of the startup, it really depends on what the company needs are and what the problems are. So there isn't a one-size-fits-all here. Once you get to a size like Dropbox, where you would probably have a tens to 100 team working on the entire monolith, that's okay. That's when it starts to make sense that okay, splitting this apart into separate entities might make more sense, you're able to push independently, you're able to not be blocking other things. But there's a huge spectrum in between, it’s not one or the other. Yeah. Utsav: [06:51] So how do you tackle a problem as ambiguous as that? So you want to go from 99.9 to 99.95. And what that means, as you said, was your downtime per month can be no more than 20 minutes. How do you unpack that? And how do you start thinking about what you need to solve? Naphat: [07:12] Yeah, that's a good question. So we pretty much approached everything from first principles when we started looking into this project. So we put ourselves on-call first of all, and then we started looking into what are the various issues that come up. We look at historical data going back years, and what are some of the past outages that are common in the past, and what the themes are. And we also started talking to people who actually have been looking at this for a while and just figuring out what their perspectives are, and trying to get the lay of the land for ourselves. Once we had all this information, we just started assembling what we believe the biggest hotspots were. We knew what the goal was. In our mind, the goal was always clear that 2019, stabilize this as much as possible, get it to a point where we can buy ourselves easily, another year, year and a half to not even look at this, and then figure out what the long term path is beyond the monolith. So with that goal in mind, in 2019, we had to stabilize it, get to the newest way that we wanted. That allowed us to identify a bunch of problems. So some of the problems that we encountered, for example, was the push itself. When we had to do an emergency push, the push itself could easily take 40 minutes or longer, just because of how unstructured the push is. I think there were like 10 plus clusters in the same department just because no one has ever invested in it. And so the way we were doing the push, we actually took about, I think close to 20 minutes end to end just to do a round restart of the service. And nobody understood why that number was picked. And there was actually a lot of fear using the number. So that's one problem. We knew we had to go fix it. We identified a bunch of other problems, again, the general mindset we had really is we cannot be afraid of the monolith. If you’re going to be on-call for this, if you're going to be investing a lot of energy into this over the next year while we build the next thing, we need to get to a point where we know exactly what it’s going to do, and we know exactly where the limits are. And so I think I spent a good one to two months just poking at the service in various ways, taking away boxes, trying to drag utilization up, and then figuring out how it fails and how I can recover from it, figuring out how long to push takes and what exactly goes into the startup sequence, figuring out how long to build takes, [10:00] figuring out what the other failure modes are and figuring out how do we prevent the failure modes, to begin with? Once we have all the information, we just started getting to work. We knocked down the problems one after another and eventually got to a stable location, a stable place. I think this is true with engineering problems in general. We just had to approach everything from first principles. We just had to make sure that we don't have any biases going in, we don't have any assumptions about how a problem should be solved, and just start. You know, you have to break some eggs to make an omelet, right? So I just started doing things, started poking it, start seeing what it does. Figure out how to do it safely. First of all, figure out how do you get a safe environment to do that in. For us, the way we did it was we redirected a small percentage of traffic to a small cluster, and then we only operated on that cluster. So even if that were to go down, it wouldn't be such a huge outage. And that allowed us to mimic what it would look like on a larger cluster. And that gave us a lot of confidence. So I don't know. What's your opinion on all this? I know you also came into this with a different mindset as well in 2020. Utsav: [11:24] No, I think that makes total sense. I think the idea of splitting off a little bit of traffic to another independent cluster where you can play around and actually get confidence. Because a lot of decisions when you're so large, and all of that context has been lost, since the engineers working on that have moved on, a lot of decisions are fear-based decisions. They're not really rational, and then you kind of backfill your rationalization on like, “Oh, it's always been this way so it can probably never be better.” But you can always test things out. I think what's more interesting to me is when we, I guess went against some industry-standard things, and I think for the better. One example is, in the industry, people talk about red-green deployments where you have another version of the service running in parallel, and you slowly switch one by one. And that's basically not possible when you have like 2000 machines or just the large cluster. And yeah, if you have a limit of being down for no more than 20 minutes, you can't wait for 20 minutes for a new code push to go out because a bad outage means you've basically blown your SLA, and you can't fix that. So I'm curious to learn or to know how you thought about fixing that, and how you basically validate that approach? Because I know that the solution to this was basically pushing all of Dropbox in parallel, which sounds super scary to me. How did you gain that confidence to be okay with that? Naphat: [13:06] That's a good question. And let's talk about push in two separate steps, or two separate categories of pushes, let’s say. Let's talk about emergency pushes, which is what we're talking about here about getting a hotfix as quickly out as possible. And this is why you had to mention our SLA. And then we'll get into later on how we do the regular push, and we'll talk about what the nuances are, what the differences between them are. At the end of the day, again, it comes down to doing what’s practical. So what do we know? We know that we have a 40 minutes downtime or 20 minutes downtime SLA. What that means is that for the most part, you should probably be able to push in like five to 10 minutes or 15 at the most. And so how do you do that against 2000 machines? That just completely threw off some pushes. Like you said, we are not going to be able to do any kind of meaningful Canary on all this. That's just not enough time to get it on the beta. So what do we do? So we started looking into what the build and push actually does and breaking down what are some of the current timings, which one of them are things that we can never change, and which one of them are things that we can just configure. So what goes into a push? When we tried to sell our hotfix, we track everything in the beginning. We start from people writing code to actually fix it. But let's say that you had to do a one-line fix to something, before we started all this, it took about maybe 10 minutes [15:00] or 15 minutes to actually create the branch and get someone else to do a quick approval and actually commit it in and then make a hotfix branch. So that's usually 10 to 15 minutes lost. We then have another build step that, prior to this, took about five to 10 minutes as well. So that's like, let's say 20, 25 minutes, at this point lost. And then the push itself, in one step, the push took about 20 minutes without any kind of Canary. So we will add about 45, without any kind of Canary, without any kind of safety. So this seems like an impossible problem. So let's break it down. Why does it take 10 to 15 minutes to create a commit? It really shouldn’t. The main problem here is that we were using the same workflow for creating a commit for hotfix, as we were for doing regular kind of code review, which they're really, really different. You're doing a hotfix, it should either be a really small change, or it should be something that you already committed somewhere else into the main branch, and you already know with some other validation that it probably works already. Anything else that you’re going to review, you're probably not going to figure it out in 20 minutes anyways. So you're not going to go through the hotfix flow. Oh, and I forgot to mention that when something goes wrong, the first thing you should do is check whether you can roll back, that actually is a lot faster. Now we'll get into that in a bit, and how do we speed it up as well. [16:17] So we run a quick tool, just a tool that takes a commit that’s already on the main branch and creates a hotfix branch with the right configuration, and then kicks off the push. This reduces the 10 to 15 minutes initial time to about two minutes. So okay, making good progress here. We then started looking into build. What's going on here? And it turns out that because of how the Build team was structured at Dropbox, a lot of the investment that we made in the build speed, and we had a lot of talk about this externally, they’re using Bazel, we actually cache our builds and all that, a lot of the investment we made was submitted to CI, and not to production build. And this is not a hard problem, it's just someone has never thought to look into this. And someone has never spent the time to look into this. So coming from a CI team, I knew exactly where the knobs were. And so I just talked to a few of my old teammates, including you, and we figured out how do we actually speed up the production build, and we cut it down to about three to four minutes. So that's pretty good. So between this and creating a commit, we're at about five to six minutes. And that leaves us with another 14 minutes to do the push that we need to. [17:42] And now let's talk about the actual push. We need to get the push time down. Twenty minutes is never going to work. So we need to get it down to something that is manageable and something that we believe will be, in the event of a rollback up, will give us more than enough time to make a decision. And so we sort of picked five minutes as our benchmark, as our loss tine, and we wanted to get there. This just came out from intuitively you have 20 minutes, and you need to do a rollback, you kind of want some time to develop and you want to have plenty of wiggle room in case something goes wrong. So let's say five minutes. We took a look at how the push was done and really, the 20 minutes push that was happening was completely artificial. There was a bunch of delays inserted into the push with the fear that if you were to restart a server too quickly, because this is a very large deployment, other services that this server talks to might be overwhelmed when we first create the connection. It is a valid fear. It has never actually been tested. And there are actually ways that we have set up our infrastructure such that it wouldn't have this problem. For example, we actually deploy a sidecar with each of the servers and that's reduced the number of connections to upstream services by about 64x because then you run 64 processes per box. So we have things that we know to be true that make us believe that we should be able to handle much quicker restarts. [19:20] So how do we validate it? There really isn't a much better way to do this than to try it out, unfortunately. And so we made the small cluster that I was talking about earlier because at the end of the day, what we care about is-- There are two things we have to validate. We have to validate that the server itself is okay when it restarts that quickly and then we have to validate that the upstream services are okay. With the server itself, you don't need a large deployment. You can validate under small deployment and eliminate one side of the problem. And then with the upstream services, we just had to go slow. [20:00] There isn't really another way to go about this. So we just had to go slow and monitor all the services as we go. So we do 20 minutes to 18 to 16 to 14, eventually to five. And we fixed a bunch of things along the way, of course, because issues did come up. And now we have five minutes push. So if you look at the end to end now, we have about five to six minutes to create a commit, and then five minutes to do the push. That leaves us about nine minutes to actually do extra validation to make sure the push is actually safe. [20:35] And so we started thinking, “Okay, maybe we do a very informal Canary,” where if we were to do a hotfix, we probably know exactly what we're pushing, it’s probably only one commit, or rather, we [Inaudible 20:47] so there’s only one commit. And there’s probably only one very, very distinct change. So what if we just very quickly push Canary within one or two minutes, because it's just a subset of the machines, and then just see whether for those new boxes, obviously, you've seen the same errors as before, we have the metrics to tell us that. And this is very different from doing a blue-green deployment, where you would have to create two clusters of machines, one with the old code, one with the new code, and then try to compare metrics between them. But this is just all eyeballing, all looking at exactly what error that we know is causing the site to crash and seeing whether it's coming down. So we built that in. [21:37] We also built in another validation step for internal users, where we were to push to a deployment that only failed internal Dropbox users. And this is optional, depending on just how quickly you want to go. And then we codify all this into a pipeline that we can just click through with buttons in the UI. At the end of the day, we got that built and pushed down to about 15 to 18 minutes, depending on the time of day, or depending on how lucky you are with the build, and it worked really well. And then, at that point, it became a question of, “Okay, now that we have this build time and push time down, how do we actually keep this going? And how do we make sure that things don't regress?” Because a lot of the problems that we discovered are things that will regress if you don't keep an eye on it. The build cache could be easily broken again. So we established a monthly DRT that teams that were on-call were supposed to go and try doing the push ourselves and see that it actually completes in time. And then we postmortem every DRT to make sure that, “Okay, if it didn’t complete in time, why didn't it complete in time? If you do the breakdown, where was the increase? Is it in the build? Is it in the push?” and go from there. Utsav: [22:51] Yeah, and I think one of those things that stands out to listeners, like I know the answer, but I want to hear it again, is that if you push all of Dropbox at the same time, doesn't that mean you drop a lot of requests, like for those five minutes everything is down? And that's what people are worried about, right? Naphat: [23:11] Right. So we are not exactly pushing everything at the same time. You should think of a push as a capacity problem. So how many machines can you afford to take away at a time while still serving the website? Whatever the number is, take that and restart those machines as quickly as possible. So that's what we did. And so we set the threshold for ourselves at 60%. We never want utilization to go above 60%. And just to be safe, we only took 25% and actually used that for pushing. And then that allowed us to push everything in for budget. And that means that each batch needs to restart as quickly as possible. It's isolated to that one machine and so we will stop traffic to that one machine and just kick it as quickly as possible. It takes about one minute per batch, one and a half minutes per batch, depending on the best. Utsav: [24:04] But then still, wouldn't you see like a 25% availability here, if you're pushing 25% at the same time? Naphat: [24:11] No, because we stop traffic to the boxes first and reroute the traffic to the other boxes. So this is why we need to turn it into a capacity problem and make sure that, okay, if you know that any given time the site is never more than 60% utilized, you can afford to take 25% away and then go to 15% overhead. Utsav: [24:30] Yeah. How do you route traffic to the other boxes? What decides that? Because it seems like this is a complicated orchestration step? Naphat: [24:40] It actually isn't that complicated. So the infrastructure at Dropbox is very well structured in this sense. So there is a global proxy. At the time, we actually wrote a few blog posts about it. There is a global proxy called Bandaid. Dropbox [25:00] was in the process of replacing it with Onvoy right when you and I left. But there's a global proxy called Bandaid. This one keeps track of all the monolith boxes that are up. And when I say up, it means passes health check. So when we go and push a box, we make sure it fails health check right away, the global proxy kicks it out of the serving pool within five seconds, we wait for [requested rain 15:16] , and then we just kick it as quickly as possible. Utsav: [25:29] Okay, so it becomes this two-step dance in the sense where the global proxy realizes that you shouldn't be sending traffic to these old boxes anymore. And it can basically reroute and at that time when you can get a new version. Naphat: [25:43] Exactly Utsav: [25:44] Okay. Naphat: [25:45] Exactly, exactly. It's not at all a hard problem. It's just a matter of, “Okay, do we have the architectures to do this or not?” And we do with the global proxy in place. Utsav: [25:65] Yeah, and with things like Onvoy, I think you get all of this for free. You can configure it in a way that-- And I think by default, I guess once it realizes that things are failing health checks, it can kick it out of its pool. Naphat: [26:09] Right. Utsav: [26:10] Yeah. Naphat: [26:11] And this is a feature that you need anyway. You need a way to be able to dynamically change the serving size. Machines will go down. Sometimes you add emergency capacity, but I feel like you need the global proxy to have this particular feature. And it comes out of often Onvoy like you said. It's just a matter of how do you actually send the proxy that information? At Dropbox, we did that via Zookeeper. There are other solutions out there. Utsav: [26:37] And can you maybe talk about any other reliability problems? So this basically helps you reduce the push time significantly, but were there some interesting and really low-hanging fruits that you're comfortable sharing? Small things that nobody had ever looked at but ended up helping with a lot of problems. Naphat: [26:58] Yeah, let's see. It’s been so long, and I haven't actually thought about this in a while. But one thing that came to mind right away, which is not at all a small one, but I think it's still funny to talk about. So as part of doing filling out the push, we needed to know, what is the capacity constraint on the server? As in how many percent utilization can it go through before it actually starts falling over? The intuitive answer is 100% but severs are like things in life, never that clear or never that easy. And so, among the monolith on-call rotation, there was a common belief that we should never go above 50%. And this bugged the heck out of me when I joined because it seems like we're leaving things unused. But we have seen empirically that when you get to about 60, 70% on some clusters, the utilization often jumps. It jumps from 70 to 100 right away, and it starts dropping requests. We had no idea why. And so that's interesting. How do you actually debug that? Very luckily for us, there were a lot of infrastructure investments that went into Dropbox that allowed this kind of debugging to be easier. [28:22] So for example, just before I started working on this, Dropbox replaced its entire monitoring system with a new monitoring system that has 16-second granularity. That allowed us to get a different view into all these problems. And so it turns out that what we thought was 70% utilization was actually something that spiked to 100% every one minute, and then spikes back down to less than around 60% and really averages out to about 70. So that’s the problem. So that turns out [Inaudible 00:29:02] it’s just a matter of profiling. And it turns out that that's just because the old monitoring system that we hadn't completely shut down yet, it would wake up every minute in a synchronized manner because it's the same time. It would wake up on the minute on the clock and do a lot of work. And so the box will be entirely locked up. So one of our engineers on the team said, “Okay, you know what? Until we shut down the monitoring system, what if we just make sure if not synchronized, and we just play it?” That allowed us to go to about 80, 90% utilization. That's pretty good. And that's why we kept it about 60%. [29:46] So this is one of those things where we really just had to not be afraid of the things that we are monitoring. And so the way we discovered all this is we just, again, created a small cluster and then started taking [30:00] boxes away to drive the utilization up, and we just observed the graph. And the first time I did this, I was with another engineer, and we were both basically freaking out with each extra box we were taking away because you never know with this particular monolith, how it's going to behave, and we still didn't fully understand the problem. And we actually caused maybe one outage, but it wasn't a huge outage. But it’s fine, this allowed us to actually debunk the actual real cause and now we actually fixed it. And of course, when we shut down the old monitoring system, this problem goes away entirely. So yep, that's one thing. Then you might also ask why we were only at 90% utilization and why couldn’t we go to 100. Because that’s the quirks in how our load balancing works. Our server is impressive and so because of that, our utilization is fixed, unlike most companies. Most companies, when they get to 100%, they can serve a little bit more, they will just slow down everything. For us at 100%, we just start dropping requests. And unless your load balancing is completely perfect and knows about every single machine, every single process, it’s not going to achieve [Inaudible 31:14] 100% and our load balancing isn't perfect. Utsav: [31:19] That makes total sense. And I think that's why distributed systems engineers should remember the concept of jitter because you never know when you need it, and for how long it will basically waste so much capacity for your company. That’s one expensive mistake. I remember seeing those graphs when we shift that change. It was so gratifying to see. And also, I felt, and I wish that we had added a little bit of jitter and we had never had to deal with this. Naphat: [31:48] This is the theme of what my 2019 looked like. At the end of the year, we actually sent out a giant email with all the fixes we did. I wish I still had it and I could actually read it to you. But we try and email all the fixes we did. Let's say 60% of them might have been major fixes but 40% of them were all minor fixes like this, all one-liner fixes that we just never invested enough to actually go and look. And so it's funny, but it saved us a lot of energy, it saved us a lot of time. By the end of the year, the on-call rotation was pretty healthy. You joined in 2020, you tell me how healthy it was. But I don't think there were that many hitches by the end of it. We could actually-- Utsav: [32:37] It was surreal to me that the team in charge of pretty much the most important part of the company, in a sense, had lesser alerts than the team that had their own CICD system. I think that just made me realize that a lot of it is about, thinking from first principles, fundamentally, it's just a service. And the amount of time you spend in on-call toil is directly proportional to the amount of investment but inversely proportional to the amount of investment you have into quality and reliability. And things don't have to be that bad, but it just takes a little bit of time to fix and investigate. Naphat: [33:31] Yep, for sure. And this is just the mindset that I want every infrastructure engineer to have. By the way, so a lot of the things I talk about are things like what you would normally give to a team of SIE to solve. But the people who worked on this with me, only one of them was a full-time SIE. And that's not saying anything against SIEs, I'm just saying that when you are infrastructure engineers, you need to be in a mindset where you don't divide the work completely between SIE and SV. If you have the flexibility, you have the resources, sure. But when you don't, you need to be able to go in-depth. You’re only going in both directions. Infrastructure SIEs should be able to go and do some SV work. SV infrastructure should be able to go and investigate how servers react. And it will just allow us to build better infrastructure in turn. And so yeah, no, this is just-- Looking back, it's kind of funny, it seemed like such an insurmountable problem at the time. But at the end of the day, you know what? We just had to go fix it one by one. And now we just have good stories to tell. Utsav: [34:33] So then all of this begs the question, now that the monolith is generally reliable in 2020, and the goal for 99.95 is right there, why is the long-term decision to move away from the monolith if it works? What is the reasoning there? Naphat: [34:55] That's a very good question. [35:00] So what you're talking about is our 2020 project. And it's called the Atlas Project. There's a blog post about it if you Google Dropbox Atlas. Utsav: [35:07] I can put it in the site notes or in this podcast notes. Naphat: [35:11 Perfect, perfect. So it is basically a rewrite of our biggest service at Dropbox. Biggest sideline service, I’d say. Dropbox also runs Magic Pocket, that is way bigger than this. That's a staple service. So this is a rewrite of our biggest service at Dropbox. We undertook it for two main reasons. First is developer productivity. And what I mean by that is, at some point, the monolith really starts to restrict what you can do when you have 1000 engineers contributing to it. Well, let’s say five to 600 product engineers contributing to it. What are some restrictions here? For example, you can never really know when your code is going to go out. And this is a huge problem when we do launches because it could be that you're trying to launch a new feature at Dropbox, but someone made a change to the help page and messed that up. And we can’t launch something with a broken help page. And this kind of thing, yes, there should be tests but not everything is tested. It's just the reality of things in software engineering. So what do you do at that point? Well, then you have to roll everything back, fix the help page, then roll back out. And actually, we didn't talk about the actual push process, but we actually went to-- We have a semi blue-green deployment similar to that. It takes about two to three hours to run. This is the thing that we used to push out thousands of commits but we have to be more careful because we don’t actually know what we are pushing. So if you had to restart the entire process, you set back any launch you have by two to three hours. So this was a huge problem with Dropbox. There were other problems, of course. You never quite know if someone else is going to be doing something to the memory state that will corrupt your endpoint. You are not allowed to push any more frequently than whatever the cadence that the central team gives you which happens to be daily. When you're doing local development, you can never really start just a special site that you care about, you have to start the entire thing, which takes a long time to build. So the monolith itself really restricts you to what you can do, how productive you can be when you get to a certain size. And again, this is a huge size, I'm not saying every company should do this. We got to a few million lines of code and five to 600 engineers contributing to this, 1000 plus endpoints, so this is an extreme scale. On the other note, the other reason we embarked on this project was also this particular server, it was built at the beginning of Dropbox. It used the latest and greatest technology at the time, but it really hasn't caught up to the rest of the company. For example, the server that we were using, there wasn't any support for HTTP/2.0. And so we were, at some point, having to buff where-- It didn't have support for HTTP/1.1 and 2.0. It only supported 1.0. Our global proxy, the Bandaid proxy that we talked about earlier, and also Onvoy, these proxies, they only support HTTP/1.1 and above. And so we were having these two things talk using incompatible protocol for the longest time. For the most part is fine, except for some parts that it's not. And so we would actually have a bug for like an entire week, where, in some cases, we will return a success, 200, with an empty body, just because [Inaudible 39:07] incompatible protocol. We could probably upgrade this Python server but doing so requires a significant amount of work. You already go through with that amount of work, maybe actually we should also think about, “Okay, what else do we want to change here? What else can we do to actually move this in the direction that we all want it to move in? And just for context, in all this time that the server existed at Dropbox, every other service at Dropbox had already moved to gRPC. So gRPC was a very well-supported protocol at Dropbox. It was very well tested. There was a team that actually upgraded it regularly and run flow tests on it and all that, but this thing just hasn't kept up. So we needed to find a way to get to something else that is equally supported when we get to gPRC itself. [40:00] So enter Atlas. We decided, “Okay, time to invest in this for real. We bought ourselves time in 2019. Let’s now go do some engineering work, let's figure out how do we build something that we're going to be proud of.” And I take this very seriously because back when I was still doing interviews when I was still based in the US, every time I had to talk about the Dropbox codebase, I sort of talked around the complexity of the monolith because we all know just how bad it was. And when I was actually talking about it, I just had to say, “Yes, it's bad, but also look at all these other things.” I don't want to keep saying that. I just want to say, “You know what? We have a great platform for every engineer to work in, and we should all be proud to work here.” That matters a lot to morale at a company. Utsav: [40:51] Yeah. Naphat: [40:53] And so we embarked on a project. We first started building a team. The team had to be almost completely rebuilt. You joined. That made my day. That made my quarter, let's say. And then we started putting together a plan. It was a collaboration between us and another infrastructure team at Dropbox. We had to do a bunch of research, put together a plan for, “Okay, what do we want the serving staff to look like? What do we want the experience to look like?” We should also get into what our plan actually was and what [Inaudible 41:29] were shipping. Do you have anything that you want to add to the backstory here before we move on? Utsav: [41:35] No, I want to ask actually a few more probing questions. So if you see other big companies Slack and Shopify, in particular, they seem to have worked around the problem by, first of all, their codebase isn't as old as ours. So I doubt that they’re running into the kind of random bugs that we were running into, but they're still pretty big. And they seem to work around the problem by pushing 12 times a day, 14 times a day, they just push very frequently. And that also gets developers’ code out faster. So that solves one part of the developer productivity problem. I guess why did we not pursue that? And why did we instead decide to give people the opportunity to push on their own, in a sense, or have one part of the monolith’s code being buggy not block the other part rather than just push really frequently? Naphat: [42:30] Right. I would have loved to get to that model. And that was actually the vision that we were selling as we were selling Atlas to the rest of the company. It's just that the way that the monolith was structured at Dropbox, it wasn't possible to push that many times a day. It wasn't possible to push just a component, you had to push everything. So about pushing many times a day, for example. We couldn't get the push to be reliably automated, just because of how many things we’re pushing and how many endpoints we’re pushing. So we built this blue-green deployment, I'm going to call it Canary Analysis because that's what we called it at Dropbox. So the way it works is that we have three clusters for the monolith. We have the Canary cluster, which receives the newest code first, we have the control cluster that receives the same amount of traffic as Canary, but will be on the same version of code as prod. So it's the same amount of traffic, same traffic pattern, and all that. So during a push, we push Canary and we kick control, so they have the same life cycle. If the Canary code is not worse than control then all the metrics should look better - CPU should look better, memory should look better, all the same, everything like [Inaudible 43:44] and all that should look better. We then write a script that goes and look at all these metrics after an hour and just make sure that can you actually proceed forward with a push or not? It turns out that let's say more than half the time, the push wasn't succeeding and so Dropbox actually had a full-time rotation for pushing this monolith. If we were already running into problems every other time, there was just not a feasible way we were going to get to aggressive pushing, not even four times a day, let's say three times a day. We weren’t going to get that. It's not going to fit within the business day. Keep in mind also the Canary Analysis itself takes about two to three hours. And the reason it took that long is also A, we wanted the metric but also, we made each push itself very slow just so that it made the problem we catch it in time, we rollback. So because of how it's built, it's just not feasible. So we would actually have to figure out how to work around all these problems, and Atlas, the project that we were building towards will solve the problem or will at least give us the foundation to [45:00] go forward with that eventually. I remember this because this is one of the first questions you asked me when you joined the team. And I completely agree, and I really hope that Dropbox will get there. I left before the project was completely rolled out but having talked to people who are still there, they are very much on target. And going forward, they can easily then start to invest and say, “Okay, you know what? Now that push is reliable, let's push multiple times a day.” Utsav: [45:29] Yeah, I think philosophically, the reason why people like splitting their stuff up into multiple services, it's basically there's like separation of concerns and all of that. But also, you get to own your push speed, in a sense. You get to push without being blocked on other people. And by breaking up the monolith into chunks, and if not letting people push on their own, at least not blocking their push on somebody else's bad code, I think that basically breaks down the problem and makes things much more sustainable for a longer period of time. Because the monolith is only going to get bigger, we're only going to get more engineers working on multi-products and more features and everything. So that's why I feel like it's actually a very interesting direction that we went into, and I think it's the right one. And yeah, we can always do both in parallel as well. We can also push each part of the monolith like 12 times a day and make sure that that doesn't get blocked. Naphat: [46:31] Yeah, but just to make sure that we don't completely give the win to microservices here, there are real problems with microservices. And Dropbox tried really hard to move to the services model a few years earlier, and we couldn't quite get there. There are real problems. Running a service is not easy. And the skillset that you would need to run a service, it's not the same skill set that you will need to write a good product or write a good feature on the website. So we're asking our engineers to spend time doing things that they're not comfortable doing. And we're asking them to be on-call for something that they are not completely comfortable operating. And from a reliability standpoint, we are now tasked with setting the standard across multiple services instead of just dealing with one team. There are real problems with microservices and so this is why we didn't-- [47:25] Let's talk about Atlas a little bit. So Atlas, the next generation of Dropbox’s monolith, we sort of tried to take the hybrid approach. First, I said microservices have real problems, but our monolith is not going to scale. So what do we do? So we went with the idea of a managed service. And what that means is, we keep the experience of the monolith, meaning that you still get someone else to be on-call for you. We get our team, our previous team to be on-call, you get someone else pushing for you, in any case, it's going to be automated. It's going to be a regular push cadence, you just have to commit a quarter section-time, and it will show up production the next day. You're not responsible for any of that. But behind the scenes, we are charting everything out into services, and you get most of the benefits of a microservice architecture, but you still get the experience of a monolith. There are some nuances like we do have to put some guardrails into, “Okay, this only works for stateless services, you're not allowed to open a background thread that goes and stuff,” that's going to be very hard for us to manage. You need to use the framework that we are going to be enforcing, gRPC, you need to make sure that your service is well behaved, it cannot just return errors all the time, otherwise, they're going to kick you out, or they're going to yell at your team very loudly, or very politely. So there are rules you have to follow, but if you follow the rules, you're going to get the experience, and it’s going to be easy. So that’s the whole idea with Atlas. And I'm probably not doing it justice, which is probably a good thing to Dropbox but it's really interesting that we got there. With Atlas, it seems like the right balance. It is a huge investment, so it's not going to be right for every company. It took us a year to build it and probably going to be another six months to finish the migration. But it's very interesting, I think it is the right direction for Dropbox. Utsav: [49:36] Yeah. I think the analogy I like to give is it's like building a Heroku or an App Engine for your internal developers. But rebuilding an App Engine or Heroku, what's the point? You could just use one of those. The idea is that you give them a bunch of features for free. So checking if their service on Heroku is reliable, we do that. We [50:00] automatically track the right metrics and make sure that your route certainly doesn't have only 50% availability. We basically make sure of that. Making sure that your service gets pushed on time, that that push is reliable, making sure for basic issues, operations just happen automatically. So we can even automatically Canary your service. So we push only 10% and we see if that has a problem. All of these infrastructural concerns that people have to think about when developing their own services, we manage all that. And in return, we just ask you to use standard frameworks like gRPC. And the way we can do this behind the scenes is if you're using a gRPC service, we know what kind of metrics we're going to get from each method and everything. And we can use that to automatically do things like Canary Analysis. I think it's a really innovative way to think about things. Because I think from user research, we basically found this very obvious conclusion in retrospect - product engineers, they don't care that much about building out interesting infrastructure. They just want to ship features. So if you think about it from that mindset, everything that you can do to abstract out infrastructural work is okay with people, and in fact, they prefer it that way. Naphat: [51:29] For sure. And this isn't specific to product engineering, every engineer thinks this way. We have a goal, we want to get to it, everything else is in the way. How do we minimize the pain? And I really liked the way you phrased it that we sort of provide a lot of things under the box. I think one of the directors at Dropbox used the analogy that Atlas will be a server with a battery included in the box. You don't have to think about anything, it just works. It sounds innovative, it probably is. You have to give credit where it’s due. A lot of these other toolings that we have, it already exists at Dropbox, we’re just packaging it together, and we're saying, “Okay, product engineers at Dropbox, here's the interface. Write your code here and you get all of this.” We’re just packaging it, you get automatic profiling, you get automatic monitoring, you get automatic alerting. It really is a pretty good experience, I kind of miss it working at a startup. Utsav: [52:28] And I think the best thing is you get an automatic dashboard for your service. You build something out and you get a dashboard of every single metric that's relevant to you, at least at a very high level. You get all of the relevant exceptions. You also get auto-scaling for free, you don't have to tweak any buttons or tweak any configuration to do that. We automatically manage that. And that's the reason why we enforce constraints like stateless because if you have a stateful service, auto-scaling and everything is a little funky. So for most use cases, an experience like that makes sense. And I think it really has to do with the shape of the company. There are some companies where they have a few endpoints, and they have a lot of stuff going on behind the scenes. But with the shape of Dropbox, you have basically like 1000 plus endpoints doing a little bit of slightly different business logic and then there's the core file system and sync flow. So for all of the users and all of the engineers working on these different endpoints, something like Atlas just makes sense for them. Naphat: [53:40] Right. And there's a real question here about whether we should have fixed this a different way. Like, for example, should we have changed how we provide endpoints at Dropbox? Could we have benefited from a structure like GraphQL, for example, and not have to worry so much about backend? It's a real question. Realistically, at a company this big, with a lot of existing legacy code, we had to make a choice that is practical. And I keep coming back to this, we had to make something that we know we can migrate the rest of the things to, but it still reasonably served the need we need. And this is actually one of the requirements into shipping Atlas. We're not going to build and maintain two systems. The old system has to die. And that constrained a lot of how we were thinking about it, some in good ways, and some not in good ways, but that is just the way things are. Someone I respect very deeply said this to me recently, “Don't try to boil the ocean.” We just try to take things one step at a time and move things in the right direction. And just make sure that as long as we can articulate a two to three years, whichever one this to look like, and we are okay with that, that's probably a [55:00] good enough direction. Utsav: [55:03] I think, yeah, often what you see in companies, especially larger ones when you have to do migrations is people keep the old system around for a while because it's not possible to migrate everything. But also, you impose a lot of cost to the teams when you're doing that migration. You make them do a lot of work to fit in the new interface. Now, the interesting thing about the Atlas project, which we've written about in the blog post is the fact that it was meant to be a zero-cost migration, that engineers don't have to do some amount of work per existing piece of functionality in order to get stuff done. Of course, it wasn't perfect, but that was the plan, that we can automatically migrate everyone. I think it's a great constraint. I love the fact that we did that but why do you think that was super important for us to do? Naphat: [55:56] Yeah, again, think we have been burned so many times at Dropbox by tech death and incomplete migration, and so we benefited from all these past experiences. And we knew that, “Okay, if we're going to do this project correctly, we need to complete the migration. And so we need to then define a system that will allow us to easily complete the migration. And we need to assemble a team that will help us complete the migration. So the team that we put together is probably responsible for a good majority of the migrations at Dropbox in the past. And so the experiences that each person brings, like, I've seen one teammate, for example, write some code that passes the Python abstract syntax tree, and then just change some of the API's around for our migration. I was in awe of the solution. I didn't think it would be that easy but he did it in a 50 lines Python file. So I think that it's a very ambitious code you have, and you need to start it in the beginning, you need to know that you're going to be designing for this so that you do everything the right way to accommodate it. But now that you have it, and now that you're doing it, by the end of it you get to reap all the benefits. We actually get to go and kill this legacy server, for example, and delete it from our codebase. We get to assume that everything at Dropbox is gRPC. That is a huge, huge thing to assume. We get to assume that everything at Dropbox will emit the same metrics, and all that, but will not behave the same way. But I think it's very satisfying to look back on. I am really glad that we took this time. And we’re patting ourselves on the back a lot, but it's not like this project went completely smoothly. There were real problems that came up. And so we had to go back to the drawing board a few times, we had to make sure that we actually satisfy a real customer need. And as a customer, I mean, Dropbox engineers, our real customers while we're building this platform. But I think we got there in the end. It took a few iterations, but we got there. Utsav: [58:32] Yeah, and maybe just to close this up, do you have any final thoughts or takeaways for listeners who are thinking about, just infrastructure and developer productivity in general. The interesting part organizationally, to me was, our team was not only responsible for keeping the monolith up, so its reliability, but also for the productivity of engineers. And maybe you can talk about why that is so important? Because if you focus only on one side, you might end up over-correcting. Naphat: [59:05] For sure. I think one thing I want to focus on is how important it is to have empathy. You really need to be able to put yourself in your customers’ shoes, and if you can't, you need to figure out how to get there, and figure out, “Okay, what exactly do they care about? And what problems are they facing? And are you solving the right problems for them?” You're going to have your own goals. You want to create some infrastructure. Can you marry the two goals? And where you can't, how do you make a trade-off? And how do you explain it? It's going to go a long way into a migration like this because you’re going to need to be able to explain, “Okay, why are you changing my work for? Why are you making me write [Inaudible 59:44] ” You can then tell a whole story about “Okay, this is how your life is going to get better, there’s are going to be pushes, you're going to get these faster wheels. But first, you have to do this, and this is your story.” That's a lot better, but it requires [01:00:00] trust to get there. So I would say that's probably the most important thing to keep in mind when building an infrastructure or really building anything - make sure you know who your customers are, make sure you have an open line of communication to them. I think there were a few product engineers at Dropbox that I came in weekly and just tried to get their opinions and bring them into the early stages [Inaudible 01:00:25] . And it was great. They sort of became champions of the project themselves and started championing it to their own team. It’s a very amazing thing to be watching, to be of service. But it really comes down to that. Just have empathy for your customers, figure out what they want. And then you really can only do that if you own both sides of the equation. If I only own reliability, I have no incentive to go talk with customers. If I only own workflows, I have no leverage to pull on reliability. So you really need to own both sides of the equation. Once the project matures, and you don't really need to invest in it anymore, you can talk about having a different structure for the team. But given how quickly we were moving, there was really only one way this could have gone. Utsav: [01:01:16] Sounds good. And yeah, I think this has been a lot of fun reliving our previous work. And it's been really exciting to be part of that project, just to see how it's gone through and seeing what the impact is. And hopefully, in our careers, we can keep doing impactful projects like that, is the way I think about it. Naphat: [01:01:39] That is the hope, but really just getting to do this is a privilege in its own right. If I never get to do another large project, I mean, I'll be a bit sad, but also not be completely unsatisfied. Yeah, I'm really glad that we got to have this experience. And I'm really glad for all the people, all the things that we got to solve along the way, all the people we got to meet, all the relationships we built, and now all the memories we have. And it just really energizes us for the next thing, I think. Utsav: [01:02:15] Yeah. And yeah, thank you for being a guest. I think this was a lot of fun. Naphat: [01:02:20] Of course, of course. This is a lot of fun. We should do this again sometime. Utsav: [01:02:26] We have a lot more stories like this. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Christina Cacioppo and Robbie Ostrow work at Vanta , an automated security and compliance company with a mission to secure the internet. Vanta sets up monitoring via a set of continuous tests to ensure basic security best practices, like mandatory MFA for employees. Each test bubbles up to one or more compliance standards like SOC-2 so that companies can rapidly move their audits and unlock deals. Apple Podcasts | Spotify | Google Podcasts This episode is special because of two reasons: I currently work at Vanta, and it’s the first combined interview with both the CEO and the first engineer at the company, which led to an interesting conversation with multiple perspectives. As usual, the episode focuses on the technology and business of Vanta, and I’ve tried to not go easy on them, even though there’s an obvious bias involved :) Highlights My notes are italicized 2:00: “In order to work on a security company, you’d actually best start with compliance company” - compliance is a “hair-on-fire” problem for companies since it helps unlock deals, whereas security is often an afterthought. Solving compliance helps make companies safer since the incentives align better. This idea and the headache of SOX compliance at my previous job convinced me to work at Vanta. 5:00 - Continuous security monitoring vs. snapshots that are double-checked in audits 11:00 - How Vanta was initially built. 17:00 - Should security reports be standardized or extremely customizable per company? 20:00 - How does someone decide on the set of security policies? Do customers ask for advice? 31:00 - How should engineers think of developer productivity for their startups? What has the impact of initial choices like MongoDB and GraphQL been as the company has grown? 40:00 - At what point should a founder decide to hire an engineer? What qualities should the engineer have? At what point should the founder stop interviewing engineering candidates? 52:00 - How to effectively build a brand for a security company? Experiences over the past few years. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Alexander Gallego is the founder and CEO of Vectorized . Vectorized offers a product called RedPanda , an Apache Kafka-compatible event streaming platform that’s significantly faster and easier to operate than Kafka. We talk about the increasing ubiquity of streaming platforms, what they’re used for, why Kafka is slow, and how to safely and effectively build a replacement. Previously, Alex was a Principal Software Engineer at Akamai systems and the creator of the Concord Framework , a distributed stream processing engine built in C++ on top of Apache Mesos. Apple Podcasts | Spotify | Google Podcasts Highlights 7:00 - Who uses streaming platforms, and why? Why would someone use Kafka? 12:30 - What would be the reason to use Kafka over Amazon SQS or Google PubSub ? 17:00 - What makes Kafka slow? The story behind RedPanda. We talk about memory efficiency in RedPanda which is better optimized for machines with more cores. 34:00 - Other optimizations in RedPanda 39:00 - WASM programming within the streaming engine, almost as if Kafka was an AWS Lambda processor. 43:00 - How to convince potential customers to switch from Kafka to Redpanda? 48:00 - What is the release process for Redpanda? How do they ensure that a new version isn’t broken? 52:00 - What have we learnt about the state of Kafka and the use of streaming tools? Transcript Utsav [ 00:00 ]: Welcome Alex, to an episode of the Software at Scale podcast. Alex is the CEO and co-founder of Vectorized, which is a company that provides product called Redpanda and you can correct me if I'm wrong, but Redpanda is basically a Kafka replacement, which is a hundred times faster or significantly faster than Kafka itself. I am super fascinated to learn about why we're building. I understand the motivation behind building Redpanda, but what got you into it and what you learned on process and thank you for being here. Alexander “Alex” Gallego: Yeah. Thanks for having me though. A pleasure being here. This is always so fun to get a chance to talk about the mechanical implementation. To a large extent to this day, this thing is pretty large now. So I get to do a little bit of code review of the business, and it's always fun to get the details. Yeah. I've been in streaming for a really long time, like just data streaming for like 12 years now and the way I got into it was through startup in New York. I was doing my PhD in crypto. I dropped out, went to work for this guy on a startup called Yieldmo. Doesn't mean anything, but it was an Ad tech company that competed against Google on the mobile market space and that's really how I got introduced to it. The name of the game back then was to use Kafka and Apache Storm and ZooKeeper 32 and honestly it was really hard to debug. So, I think I experienced the entire life cycle of Kafka from the zero point maybe seven release or 0.8 release back in 2011 or something like that. All the way until my previous job at Akamai, where I was a principal engineer and I was just sort of measuring latency and throughput. And so I sort of seeing that I will listen to Kafka and before they required ZooKeeper and all of these things. I guess the history of how we got here is, at first we were optimizing ad then we're using a Storm and this and then Mesos has started to come about. And I was like, oh, Mesos is really cool. The future of streaming it’s going to be. You're going to have an Oracle and then something that's going to schedule the container. In retrospect now, that Apache Mesos got archived or it got pushed to the Apache Attic a couple of weeks ago. We chose the wrong technology. I think it was the right choice at the time like Kubernete is barely working on a couple of hundred nodes and Mesos was proven at scale and so it just seemed like the right choice. What we focused on though is streaming. Streaming is a technology that helps you both sort of extract the value of now or in general deal with time sensitive data like a trade or fraud detection, Uber eats are all really good examples of things that need to happen relatively quickly, like in the now and so streaming systems are kind of technology designed to help you both deal with that kind of complexity. So what Concord did was, hey, we really liked the Apache Storm ideas. Back then when it was a [Inaudible 3:17], Storm was really slow and it was really hard to debug this thing called Nimbus and the supervisors and anything, stack traces on poor languages at [Inaudible 3:27] closure and Java. And I was like, I need to figure out three standard libraries to debug this thing. And so we wrote Concord in C++ on top of Mesos, with that squarely on the compute side. And so streaming is really where store and compute came together and then at the end of it, you do something useful. You say, hey, this credit card transaction was fraud claim. That's what I did for a long time. And, you know, long story short I've been using Kafka really as a storage medium. I personally couldn't get enough performance out of it with the right safety mechanics. So in 2017, I did an experiment where I took two optimize edge computers and literally with a wire back to back to each other. So no rack, latency, nothing. It's just an FPF wire connected back to back between these two computers and I measure it. Let me start up at Concord Server I think maybe two, four, something like that, or two, one at the time, a Concord Server and Concord Client and measure what it can drive this artwork to for 10 minutes by both in latency and throughput. Let me turn that down and let me write a C++ program that bypasses the cornel and bypasses the page cache for the storage level too, and see what the hardware is actually capable of. I just wanted to understand what the gap is. Where does this accidental complexity comes from? Like how much funding are we leaving on the table? [5:00] The first implementation was 34X tail-latency performance improvement and I was just floored. I took two weeks and I was comparing the bytes of the data on this, just making sure that the experiment actually worked. And so, yeah, honestly, that's, that was the experiment that it got me thinking for a long time, that hardware is so fundamentally different to how hardware was a decade or more ago when Kafka and Pulsar and all these other streaming technologies were invented. If you'd look at it, actually the Linux scheduler, block algorithm and IO algorithms, basically, it's the thing where you send data to a file and the Linux sort of organize it safe for optimal writes and reads is fundamentally different. It was designed for effectively a millisecond level latencies and the new disks are designed for microsecond level latencies. So this is a huge gap in performance improvement. Not to mention that now you can rent on Google 220 cores on a VM. You can rent a terabyte of Ram. So the question is, what could you do differently with this new [inaudible 6:11]? Like it's so different. It's like a totally different computing paradigm. And I know that [Inaudible6:19] has coined the term, "there's no free lunch." you basically have to architect something from scratch for the new bottleneck and the new bottleneck is the CPU. The delayed [Inaudible 6:30] of this, are so good and same thing with network and all these other peripheral devices that the new bottleneck is actually the coordination of work across the 220 core machine. The future is not cores getting faster; the future is getting more and more cores. And so the bottlenecks are in the coordination of work in the CPU's. And so, we rewrote this thing in C++, and that's kind of maybe a really long-winded way of saying how we got here. Utsav : That is fascinating. So maybe you can explain a little bit about where would someone deploy Kafka initially? You mentioned like fraud, and that makes sense to me, right. Like, you have a lot of data and you need to stream that. But the example that was, I guess, a little surprising was like Uber eats. So how would you put Kafka in like Uber eats? Like where would that fit in like the pipeline? Alex : Great Question. Let me actually give you a general sense and then we can talk about that case. Event streaming has been a thing for a really long time. People have been trading systems, but in the modern stack, it's called event streaming. And what is an event? An event is what happens when you contextualize data. So you have some data, let's say a yellow t-shirt. Right. Or like a green sweater, the one I'm wearing today. That's just data doesn't mean anything. But now if I say I bought this green t-shirt with my visa credit card and I bought it from, let's say, just Korean seller that is coming through www.amazon.com. And then I start to all of this context that makes an event. There's a lot of richness there. Implicitly, there's also a lot of time to that transaction. If I buy it today, there's this immutability about this facts. And so event streaming is this new way about thinking on your architecture as this immutable, contextualized data, things that go through your architecture. And so in the case of Uber eats, for example, when I go into like my Uber eats app and I select my favorite Thai restaurant, and I said, hey, get me number 47, I can't pronounce it, but I know it's the item I always get from the third restaurant across the corner. It’s like Chinese broccoli. And so, it's immutable that I paid $10 for it. It is immutable that the restaurant got it, 30 seconds later after this order. And so you start to produce kind of this chain of events, and you can reason about your business logic as this effectively a function application over a series of immutable events. It's kind of like functional programming at the architectural level. And why is that powerful? That's powerful because you can now actually understand how you make decisions. And so to go back to the case of fraud detection its really useful, not in making the decision. Like you can just create a little micro-service in node or in Python. It doesn't matter. And you just say, hey, is whatever is the credit card, both $10,000 is probably a fraudulent for, for buying Thai food. That's not the interesting part. The interesting part is that it's been recorded in an order fashion so that you can always make sense of that data and you can always retrieve it. [10:01]: So there are these properties about Kafka that Kafka brought to the architect, which were durability. I mean, this data actually lives on disk, and by the way, it's highly available. If you crashed one computer, it's going to live on the other two computers. And so you can always get back this data and 3 it’s replayable. If the computing crash, then you can resume the compute from the previous iterator. And so I think those were the properties that I think the enterprise architects started to understand and see it. They're like, oh, it's not just for finance in any way of doing trades, but it works almost across any industry. Today, we have customers, even us and we're a relatively a young company, in oil and gas measuring the jitter between oil and gas pipelines, where you have this little raspberry pie looking things. And the point of this Kafka pipeline, which was later replaced with Redpanda was just a matter of how much jitter there is on this pipeline. Should we turn it off or it's really cool? We've seen it in healthcare where people are actually analyzing patient record and patient data. And they want to use new technologies like spark and mail or TensorFlow and they want to connect to real-time streaming. For COVID, for example, we're talking with the hospital in Texas, they wanted to measure their COVID vaccines in real time and alert things for all sorts of suppliers. We've seen people in the food industry. It's like in sports betting. It's huge outside of the United States. To me, it feels like streaming is at this stage where databases were in the seventies, like before that people are writing to flat files and it works like that's the database. Every customer gets a flat file, you read it. Every time you need to change it, you just rewrite the entire customer data. And that's kind of like a pseudo database, but then database gave users and higher level of abstraction and modeling technique. And to some extent, that's what Kafka has done for the developer. It's like, use this pluggable system that has the three Tico to them as the new way to model your infrastructure as an immutable sequence of events that you can reproduce, you can consume, it's highly available. So I think those were the benefits of kind of switching to an architect like company. Utsav: Well, those customers are super interesting to hear. And that makes sense, like IOT and all of that. So maybe one more question that would come to a lot of people's mind is, at what point should you stop using something like SQS? Which seems like it provides like a lot of that similar functionality, just that it'll be much more expensive and you don't get to see like the bare-bone stuff, but like Amazon supports that for you. So why do customers stop using SQS or something like that and start using the Kafka Redpanda directly? Alex : Yeah. So here's the history. So Kafka is 11 years old now and the value to developers on Kafka is not Kafka the system, but the millions of lines of code that they didn't have to write to connect to other downstream systems. It's like the value is in the ecosystem, the value is not in the system. And so when you think about that, let's think about Kafka as two parts, Kafka the API and Kafka the system. People have a really challenging time operating Kafka the system with ZooKeeper. Even if I know that there's might be some listeners that are thrilled and they're like, oh, ZooKeeper 500 was released. Then we could talk about that about what KRaft means and the ZooKeeper 2 later. But anyways, so if you look at Kafka, it's two things, the API and the system. The reason is, and why someone would want to use the Kafka API, which by the way, Pulsar are also started supporting, it’s not just Redpanda, really like a trend in the data streaming system is you can take spark ML and TensorFlow and [ inaudible 00:14:04 ] and all of these databases, it just floats right in, and you didn't write a single line of code. Alex : You start to think about the systems of these Lego pieces. Of course, for your business logic, like you have to write the code, it's what people get paid for but for all of these other databases, all of these downstream systems, whether you're sending data to anything: Datadog for alerting, or a Google BigQuery or Amazon Redshift, or [ Inaudible00:14:32 ] to be, or any of these databases. There's already this huge ecosystem of code that works that you don't have to maintain because there’s people already maintaining it and so you're just flogging into this ecosystem. So I would say the largest motivation for someone to move away from a non-Kafka API system, which you know, is before Google gloves pops up and Azure event hub and there's like a hundred of them is in the ecosystem. [15:00] And realizing that it is very quickly at the community. I think Redpanda, for example, makes Kafka the community bigger and better. We start to expand the uses of the Kafka API into embedded use cases. For example, for this security appliance company, they actually embed Redpanda because we're in a C++, super well footprint, but they didn't embed as processed to do intrusion detection. So every time they see an intrusion into the network, they just write a bunch of data to disk, but it's through the Kafka API and in the cloud, it's the same code. It's just not one Redpanda local instance is the collector. And so I think for people considering other systems, whether it's SQS, or [ Inaudible00:15:46 ] or pops up or Amazon event hub. First of all, there are specific traders that we need to dig really to the detail, but at the architectural level is plugging into this ecosystem of the Kafka API is so important in getting to leverage the last 10 years of work that you didn't have to do and it takes you have five seconds to connect to these other systems Utsav: That is really fascinating. I guess large companies, they have like a million things they want to integrate with like open source things and like all of these new databases, like materialize all of that. So Kafka is kind of like the rest API in a sense. Alex: I think it's become the new network to some extent. I mean, people joke about this. Think about this, if you had an appliance that could keep up with the throughput and latency of your network, but give you auditability. It gives you access control. It gives you a replay ability. Why not? That I think some of our more cutting edge users are using Redpanda as the new network, and they needed the performance that Redpanda brought to the Kafka KPI ecosystem to enable that kind of use case which is where every message gets sent to Redpanda. It could keep up. It could saturate hardware, but now that they get this tracing and auditability. They could go back in time. So you're right. It's almost like you have the new rest API for micro services. Utsav : Yeah. What is it about Kafka that makes it slow? Like from an outsider's perspective, to me, it seems like when a code base gets more and more features continued by like hundreds of people over like a long time span. There's just so many like ifs and elses and checks and this and that, that tend to like blow the API service and also slow things down. And then somebody tries to profile and improve things incrementally. But could you maybe walk me through, like, what have you learned by looking at the code base and why do you think it's low? Like, one thing you mentioned was, you just do cardinal bypass and you skip like all of the overhead there, but is there anything inherent about Kafka itself that makes it really slow? Alex : Yeah. So it's slow comparatively speaking and we spent 400 hours benchmarking before it comes out because I have a lot of details about this particular investment. Let me step back and think. An expert could probably tune Kafka to get much better performance than most people. Most people don't have 400 hours to benchmark different settings of Kafka. Kafka is multimodal in performance, but I can dig it out a little bit. But assuming that you're an expert and assuming that you're going to spend the time to think of Kafka for your particular workload, which by the way, it changes depending on throughput. The performance characteristics of running sustain workloads and Kafka are actually varying. And so therefore you're threading model of areas, the number of threads for your network and the number of threads for your disk and the number of threads for your background workloads and the amount of memory, I think is this, [ Inaudible00:18:51 ] of tuning Kafka that is really the most daunting task for an engineer. Because it is impossible, I think in my opinion to ask an engineer who doesn't know any of the internals of Kafka unless they go and they read the code to understand, well, what is the relationship between my IO threads and my disk threads and my background workloads and how much memory should I reserve for this versus how much memory do you serve reserve for that. There's all of this, trade-offs that do matter as soon as you start to hit some form of saturation. So let me give the details on the parts that where we improve performance which is specifically in the tail latency and why that matters to the messaging is great and the throughput. So by and large, Kafka can't drive hardware to the similar throughput as Redpanda. With Redpanda there's always at least as much as Kafka and in some cases which we highlight is in the block [Inaudible 00:19:51 ]. In some cases we're a little better. Let's say like 30% or 40% better. The actual improvement in performance is in the tailgate and the distribution. [ 20:00 ] Why does that matter? I'm just going to focus on what I think that Redpanda brings to the market rather than the negatives of Kafka because I think we are built on the shoulders of Kafka. If Kafka didn't exist we wouldn't have gotten to learn the improvements or understand the nuances of oh, maybe I should do this a little different. So on the 3 latency performance improvement, latency, and I've said this a few times, is the sum of all your bad decisions. That's just what happened at the user level. When you send the request to the micro-service that you wrote, you're just like, oh right should I have used a different data structure? There's no cache locality, etc. And so what we focused on is how do we give people predictable tail-latency. And it turns out that for all of our users, that predictable tail-latency often results in like 5X hardware reduction. So let me materialize. All of this performance improvement, where we are better and how that materializes for users. We paid a lot of attention to detail. That means we spent a ton of engineering time and effort and money and benchmarking and test suite on making sure that once you get to a particular latency, it doesn't spike around. It's stable because you need that kind of predictability. Let me give you a mental example or mental model, which you could potentially achieve really good average latency and terrible tail-latencies. Let’s say that you write, and you have a terabyte of heap and you just write to memory and every 10 minutes you flash a terabyte. So every 10 minutes you get one request that is like five minutes long because you have to flash the terabyte with the disk and then otherwise the system looks good. So what happens is that people need to understand that you start to hit those tail-latency spikes that Kafka has the more messages you put in the system. Being that you are a messaging system monitor of your users are therefore going to experience the tail-latency. So we said, how can we improve this for users? And so in March we said, let's rethink this from scratch and that really had a fundamental impact. Now we don't use a lot of the Linux Kernel facilities. So there are global locks that happen in the Linux Kernel when you touch global object. For example, the page cache. And I actually think is the right decision for the page cache to be global because if you look at the code, there's a ton of edge cases and things that we have to optimize for it to make sure that it even just work. Then a lot more to make sure that it worked fast. So it's a lot of engineering effort that we didn't know it was going to pay off, to be honest and then he happened to pay off. So we just believe that we could do better with modern art. And so we don't have this global locks kind of at the low level on the Linux Kernel objects and because we don't use the global resources, we've partitioned the memory across every individual core. So memory allocations are local. You don't have this global massive garbage collection that has to traverse terabytes heaps. You have this like localized little memory arenas. It's kind of like taking a 96 core computer and creating a mental model of 96 little computers inside that 96 core computer and then it's structuring the Kafka API on top of that. Because again, remember that the new bottle making computer and the CPU is rethinking the architecture to maximize and really extract the value out of hardware. My philosophy is the hardware is so capable, the software should be able to drive hardware at saturation at all points. If you're not driving hardware saturation at throughput, then you should be driving hardware basically at the lowest latency that you can. And these things need to be predictable because when you build an application, you don't say, oh, let me think about what is my tail latency for this and that and most of the time I need five computers, but there's other 10% of the time we need 150 computers limit. Let's take an average of 70 or 75 computers. So it's really hard to think about building applications when your underlying infrastructure is not predictable and so that's really a big improvement. And then the last improvement on the Kafka API was that we only expose safe settings. We use rapid as a replication model and I think that was a big improvement on the state of the art of streaming. If you look at the actual implementation of Kafka, ISR replication model, Pulsar, I think it's the primary backup with some optimization replication models versus our rapid implementation. You know, that we didn't invent our own protocol. [25:00] So there's a mathematical proof of replication. But also you understand as a programmer, oh, this is what I'm used to have two or three replicas. So this is what I meant to have three or five replicas of the memory . So it's kind of all of this context. So that was a long-winded question, but you ask such a critical thing that I had to be very specific just to make sure I don't give room for ambiguity or try to. Utsav : Yeah. Can you explain why is it important the partition memory per core? Like what happens when you don't? Like, one thing you mentioned was, does the garbage collection that has to go through every day. What exactly is wrong about that? Can you elaborate on that? Alex : Yeah. So there's nothing wrong and everything works. It's the traders that we want to optimize for is reduced. Basically make it cost efficient for people to actually use data stream. To me, I feel that streaming is in that weird space [Inaudible00:25:57 ] a few years ago where there's all this money being put into it, but very few people actually get value out of it. Why is this thing so expensive to run and how do we bring this to the masses, so that is not so massively expensive? Basically anyone that has run other streaming system that is in the [Inaudible00:26:17], they always have to over-provision because they just don't understand the performance characteristics to elicit them. So let me talk about the memory of partitioning. So for modern computers, the new trend is going to increase in core count the frequency, the clock frequency, the CPU is not going to improve. Here's the tricky part where it gets very detailed. Even on one CPU or CPUs individually, it still got faster even if the clock frequency didn't improve. You're just like, how is this possible? It improved through the very low level things like instructions, prefetching, basically proud execution, like pipeline execution on there's all of these strikes at the lowest level of instruction execution. Even if the clock frequency of the CPU, wasn't getting faster, it made something like, 2X performance improvement or maybe 3X over the last 10 years. But now the actual larger training computing is getting more core counts. My desktop has 64 physical cords. It's like the Verizon 3900. In the data center, there's also this weird trend, which actually don't think the industry has settled on where even on a single motherboard, you have two sockets. So now when you have two sockets, you have this thing called NUMA memory axes and NUMA domain, which means every socket has a local memory that it makes "low latency access and allocations, whether it worked like one computer," but it can leverage remote memory from the other sockets memory. And so when you rent a cloud computer, you would want to understand what kind of hardware is it. To some extent you're paying for that virtualization and most people are running in the cloud these days. So why is needing the memory to that particular thread important? It matters because like I mentioned, latency is that sum of all your bad decisions. And so what we did is we said, okay, let's take all of the memory for this particular machine and I want to give you an opinionated view on it, which is if you're running this for really larger scale, I'm going to say the optimal production setting is two gigabytes per call. That's what we recommend for Redpanda. You can run it on like 130 megabytes if you want to for very low volume use cases but if you're really aiming to go ham on that hardware, those are kind of the memory recommendations. So why is that important? When Redpanda starts up, it's that I'm going to start one P thread for every core that gives me the programmer at concurrency and parallelism model. So within each core, when I'm writing code in C++, I code to it like it is a concurrent destruction, but the parallelism is a free variable that gets executed on the physical hardware. The memory comes in in that we split the memory evenly across every cores. So let's say you have a computer with 10 cores. We take all the memory, we sum it up, we subtract like 10% and then we split it by 10 and then we do something even much more interesting. We go and we ask the hardware, hey hardware, tell me for this core, what is the memory bank that belongs to the Relic in this NUMA domain. In this like memory and in this CPU socket. What is the memory that belongs to this CPU socket? And then the hardware is going to tell you- based on the motherboard configuration, this is the memory that belonged to this particular core. And then we tell the Linux Kernel, hey, allocated this memory. [30:00] And pin it on this particular thing and lock it. So don't give it to anybody else. And then this thread reallocates that as a single byte array and so now what you've done is you've eliminated all forms of implicit cross core communication. Because that thread will only allocate memory on that particular core, unless the programmer explicitly programs the computer. You've got to allocate memory on the remote core. And so it's kind of relatively onerous system to you get your hands on, but if you're programming an actor model. So what does that mean for a user? Let me give you a real impact. We were running with the big fortune 1000 company, and they took a 35 node Kafka cluster and we brought it down to seven. All of these little improvements matter because at the end of the day, if you get a 5.5 X performance improvement, hardware cost reduction at 1600% performance improvement, all of the things. There's a blog post, we wrote a month ago where we talk about one or the other mechanical sort of sympathy techniques that we do to ensure that we give low latency to a Kafka API. And so that was a long-winded way of explaining at the lowest level of why it matters to allocate memory. It all boils down to the things that we were optimizing for, which is saturated hardware, so streaming is affordable for a lot of people, making it low latency so that you enable new use cases like this oil and gas pipeline, for example. And yeah so that’s kind of one of the really deep [Inaudible 00:31:40]. I'm happy to compare with our pool algorithms and how that's different, but that's how we think about building software. Utsav : Now I wanted to know, what is the latency difference when you use memory from your own NUMA node, versus when you try to access like the remote memory? Like how much faster is it to just stay in your own like core, I guess. Alex : It's faster relative to the other. I think the right question to ask is what is the latency of crossing that NUMA boundary in relation to the other things that you have to do in the computer? If you have one thing to do, which is you just need to allocate more memory on that core, it'll be plenty fast. But if you're trying to saturate hardware, when you're trying to do this on Kafka, I think then let me give you orders may be made to comparison. Its a few microseconds to cross the boundary that's separate an allocate memory from another core. Just some experiments I did last year to cross the NUMA boundary and allocate memory. But let me put that in perspective with writing a single page to disk using the NBME with [ Inaudible00:32:55 ] bypass. You could write a page to a NBME device assuming non 3D Cross point technology, just regular NBME on your laptop in single to double digit microseconds. So when you say now a memory allocation is in the microsecond space, you're just like, well, that's really expensive in comparison with actually doing useful work, like send them this link to this. So I think it's really hard for humans to understand latency and unless we work in the low latency phase or have like an intuition for what the computer can actually do, or the type of work that you're trying to do in that particular case. It's really hard to judge, but that adds up now. If you have contention, then it's in like the human time depending on how contended the resources are. Let's say that trying to allocate from a particular memory bank in a remote NUMA node. If there is no memory, then you have to wait for a page fault and stuff to get written to the disk. These things just add up. And it's really hard to give intuition but I think the better intuition is like, let's compare with the other useful things that you need to be doing. They useful thing of a computer is to be used for doing some useful thing for the business, like detecting fraud protection or on our planning an Uber ride around to your house or doing all of these things. I think really expensive in comparison with the things that you can actually be using the computer for. Utsav : That makes sense to me. And is there any other fancy stuff you do, like in terms of like networking, because recently I've heard that even NIKSUN and everything are getting extremely fast and there's like a lot of overhead in software. Does the Kernel not get things exposed? Alex : I think this is kind of such an exciting time to be in computing. Let me tell you a couple of things that we do. [35:00] So we actually expose some of the latency that the disk is propagating. So let's say you're writing to disk and over time to start to learn. The device is getting busier. So the latency is where one to eight microseconds to write a page and now they're in like thirty to a hundred microseconds to the write a page, because there's like a little bit of contention. There's a little bit of queuing and you start to learn that. At some point, there are some thresholds that we get to monitor because we don't use the Linux Kernel page cache. So we get to monitor this latency and propagate those latencies to the application level, to the raft replication model. Which is very cool when you co-design a replication model with the mechanical execution model, because it means that you're sort of passing implementation details on purpose through the application level. So you could do application level optimizations. One of those optimizations is reducing the number of pluses. So raft in order to participate on this, you write the page, and then you flash the data on disk. But we could do it with this adaptive batching so that we write a bunch and then we issue a single flash with like five flashes. That's one thing. The second thing is what this latency gives you is a new computer model. We added WebAssembly. We actually took the V8 engine and that's two modes of execution for that V8 engine currently. One is as a sidecar process and one is in line in the process. So every core gets a V8 isolate, which is like the no JS engine. Inside of V8 Isolated there is a thing called V8 context. And just go into the terminology for a second because there's a lot of terms. It means that you connect to the JavaScript file. Inside is a V8 Concept. In fact, a context can execute multiple JavaScript files. So why does this matter? Given that Redpanda becomes programmable storage for the user. Think of like the Transformers, when Optimus Prime unite and then all the robots make a bigger robot kind of thing. It's like the fact that you can ship code to the storage engine and change the characteristics of the storage engine. So now you are streaming data, but because we're so far that were like, oh, now we have a latency budget, so we can do new things with the latency budget. We can introduce a computer model that allows the storage engine to do inline transformations of this data. So let's say you sent the JSON object and you want to remove the social security number for DDPR compliance or HIPAA compliance or whatever it is. Then you could just hit the JavaScript function and it will do an inline transformation of that, or it will just obscure for performance. So you don't reallocate just write xxx and then pass it along, but now you get to program what gets at the storage level. You're not ping-ponging your data between Redpanda and other systems. You're executing to some extent. You're really just sort of raising the level of abstraction of the storage system and the streaming system to do things that you couldn't do before, like inline execution or filtering inline execution of max gain, simple enrichments. Just simple things that are actually really challenging to do outside of this model. And so you insert the computation model where now you can ship code to the data and it executes in line. And so some of the more interesting things is actually exposing that the WebAssembly engine, which is just V8 to our end users. So as an end user, we've now the Kafka API where you say, RPK this command line, things that we run, wasn't deploying to give it a JavaScript file. You'd tell it the input source and the output source and that's it. The engine is in charge of executing that little JavaScript function for every message that goes in. So I think this is like the kind of impact that being fast gives you. You're now have computational efficiencies that allow you to do different things that you couldn't do before. Utsav : That's Interesting. I think one thing you mentioned where there was like HIPAA compliance or something to get rid of information, like what are some use cases that you can talk about publicly that you've seen that you just would not expect? And you were like, wow, like I can't believe, that is what this thing is being used for. Alex : Yeah. let me think. Well, you know, one of them is IP credit score. So why that's interesting is not as a single stuff, but as an aggregate of steps, it's really critical. So, let me just try to frame it in a way that doesn't [ Inaudible00:39:50 ] the customers, but we have a massive customer internet scale customer. [40:00] They are trying to give every one of their users. So really like profiling information that is anonymous, which is kind of wild. So every IP gets a credit score information. So let's say in America, you have credit scores from like 250 to 800 or maybe 820. And so you give it the same number or like a credit score to every IP and you monitor that over time, but now they can push that credit to score in [Inaudible00:40:27] inside that IP. And then you can potentially make a trade on that. And so there's all of this weird market dynamics. Let me give you this example. Let's say you watch the Twitter feed and you're just like, oh, what is the metadata that I can associate with this particular period and can I make a trade on that? So it's like a really wild thing to do. And then the last one that we're focusing on is this user who is actually extending the WebAssembly protocol, because it's all in GitHub. So you could literally download Redpanda and then stop our WebAssembly engine or your own web assembly engine. Here's actually spinning out very tall computers that are tall both in terms of CPU, in terms of memory and in terms of GPU. He has a [Inaudible00:41:19] job running on the GPU and then every time a message comes in it would make the local a call to this sidecar process that is running this machine learning thing to say, like, you know, should I proceed or should I not proceed with this particular decision? Those are the things that we just didn't plan for. I was like, wow when you sort of expand the possibilities and give people a new computing platforms, everyone will use it. It’s actually not a competing platform. It enriches how people think about their data infrastructure because a spike and Apache Flink, they all continue to work. Like the Kafka API continues to work, were just simply expanding that with this WebAssembly engine. Yeah. Utsav : I think it's fascinating. Let's say that you want to build a competitor to Google today. Like just the amount of machines that they have for running search is very high. And not that you'd be easily be able to build like a competitor, but at least using something like this will make so much of your infrastructure costs cheaper, that it's possible to do more things. That's like the way I'm thinking about it. Alex : Yeah. We're actually are in talks with multiple database companies that we can't name but what's interesting is that there's multiple. We are actually their historical data, both the real-time engines and their historical data. So Kafka API and of course Redpanda give the programmer a way to address by integer every single message that has been written to the log. It gives you a total addressability of the log, which is really powerful. Why? So if you're a database, imagine that each one of these address is just like a page on a file system, like an immutable page. So like this database vendors, they're streaming data through Redpanda, and then, it gets written to the Kafka, a batch, and then we push that data to S3. We can transparently, hydrate those pages and give it to them. So they've actually built an index on top of their campaign index that allows them to reason of a Kafka batch, as you know, it's real, the guiding way to fetch pages. It sounded like this page faulty mechanism and can also ingest real-time traffic. And so that's another like really sort of fascinating thing that we didn't think of until we started building this. Utsav : Yeah. So then let me ask you, how do you get customers ready? You build this thing, it's amazing. And I'm assuming your customer is like another principal engineer somewhere who's like frustrated at how slow Kafka is or like the operational costs. But my concern would be that how do we know that Redpanda is not going to lose data or something and right now it's much easier because you have many more customers. But how do you bootstrap something like this and how do you get people to trust your software and say, yes, I will replace Kafka with this? Alex : Yeah. So that's a really hard answer because building trust is really hard as a company. I think that's one of the hardest thing that we had to do in the beginning. So our earlier users were people that knew me as an individual engineer. Like, there are friends of mine. I was like, I'm building this would you be willing to give it a shot or try it. So, that only lasts so long. So what we really had to do is actually test with Jepsen, which is basically a storage system hammer [45:00] that just verifies that you're not b**********g your customers. Like if you say you have a rapid limitation, then you better have a rapid limitation according to the spec. Pilo is a fantastic engineer too. And so what we did is that we were fortunate enough to hire Dennis who has been focused on correctness for a really long time. And so he came in and actually built an extended and internal Jepsen test suite and we just test it for like eight months. So it seems like a lot of progress, but you have to understand that we stayed quiet for two and a half years. We just didn't tell anyone and in the meantime, we're just testing the framework. The first rapid implementation, just to put it in perspective, took two and a half years to build something that actually works. That is scalable and this isn't like that overnight. It's like, well, it took a two and a half year to build a rapid implementation and now we get to build a future of streaming. And so the way we verify is really through running our extended Jepsen suite. We're going to be releasing hopefully sometime later this year, an actual formal evaluation with external consultants. People trust us because their code is on GitHub. And so you're just like, well, this is not just a vendor that is saying they have all the cool stuff and underneath it seems just a proxy for Kafka and the Java. And I was like, no, you could go look at it. A bunch of people have bought like 300,000 lines of C+++ code or maybe 200 something. It’s on GitHub and you can see the things that we do. I invite all the listeners to go and check it out and try it out, because you could just verify this claims for yourself. You can't hide away from this and it's in the open. So we use the thing [ Inaudible 46:51 ] so everyone can download it. We only have one restriction that says we are the only one that is allowed to have a hosted version of Redpanda. Other than that you can run it. In fact, in four years it becomes Apache 2. So it's really only for the next four years. So it's really, I think, a good tradeoff for us. But you get trust multiple ways. One is, people will know you and they're willing to take a bet on you but that only lasts with like your first customer or two. And then, the other ones is that you build a [ Inaudible 47:24 ] like empirically. So that's an important thing. You prove the stuff that you're claiming. It is on us to prove to the world that we're safe. The third one is we didn't invent a replication protocol. So ISR is a thing that Kafka invented. We just went with Raft. We say, we don't want to invent a new mechanism for replicating data. We want to simply write a mechanical execution or Raft that was super good. And so, it's relying on existing research like Raft and focusing on the things that we were good at, which was engineering and making things really fast. And then, over time there was social proof, which is you get a couple of customers and they refer, and then you start to push petabytes of traffic. I think a hundred and something terabytes of traffic per day with one of our customers. And we thought, some point the system is always intact. If you have enough users, your system is always intact. And I think we just stepped into that criteria where we just have enough users that every part of the system has always been intact but still, we have to keep testing and make sure that every command runs through safety and we adhere to the claims. Utsav : Yeah. Maybe you can expand a little bit about like the release process. Like how do you know you're going to ship a new version that's like safe for customers? Alex : Yeah. That is a lot to cover. So we have five different types of fault injection frameworks. Five frameworks, not five kind of test suites. Five totally independent frameworks. One of them, we call it Punisher and it's just a random poll exploration and that's sort of the first level. Redpanda is always running on this one test cluster and every 130 seconds the fatal flaw that is introduce into the system. Like literally, there's like an estimated time in that logged in, not manually, but programmatically and removes your data director and it has to recover from that. Or [ Inaudible 49:32 ] into the system and that's K-9 or it sends like the incorrect sequence of commands to create a topic in Kafka at the first level and that's always running. And so that tells us that our system just doesn't like [ Inaudible49:50 ] for any reason. The second thing is we run a fuse file system and so what it does is then instead of writing to disk, it writes to this virtual disk interface, and then we get to inject the deterministic. [ 50:08] The thing about Fault Injection is when you combine three or five, like edge criteria is when things go south. I think through unit testing, you can get pretty good coverage of the surface area, but it's when you combine like, oh, let me get a slow disc and a faulty machine and a bad leader. And so we get to inject like surgically for this topic partition, we're going to inject a slow write and then we're going to accelerate the reads. And so then you have to verify that there's a correct transfer of leadership. There's another one, which is a programmatic fault injection schedule, which we can terminate, we can delay, and we can do all these things. There's ducky, which is actually how Apache Kafka gets tested. There’s this thing in Apache Kafka called ducktape and it injects particular balls at the Kafka KPI level. So it's not just enough that we do that internally for the safety of the system but at the user interface the user are actually getting the things that we say we are. And so we leveraged now with the Kafka tests, because we're Kafka API compatible is to just work to inject particular failures. So we start off with three machines. We take Liberty Kafka and, then write like a gigabyte of data and we crashed the machines. We bring them back up and then we read a gigabyte of data. We start to complicate experiments. And so that's the fourth on and then I'm pretty sure I mentioned another one but I know we have five. And so the release process is actually every single committed to Dev gets run for a really long time and ZrZi is parallel. So I think every merchant to Dev takes like five hours or something like that but we paralyze it. In human time, it takes one hour, we just run like five different petals of the time. So I mean, that's how we do it. That’s really sort of the release process. It takes a long time and of course, if something breaks, then we hold the release. And in addition to that, there's like manual tests too because there things that were starting to codify into chaos. Utsav : I wonder if the Kafka can use the frameworks that you've built for it. And maybe that will be an interesting day when Kafka start using some of that. Alex : Some of the things are so companies to the big. To launch tests, we have an internal tool called Veto like vectorized tools and the name of the company. So we say veto, give me a cluster and it just pick the binary of from CI, deploys it into a cluster and start particular to the test. And it's specific to our setup. Otherwise a lot of these tests, I think like three out of five are in the open. The things that are actually general people can just go on and look at so it could help. But the other two, the ones that are more involved are just internal. Utsav : Okay. What is something surprising you've learned about the state of Kafka? You had an opinion when you started Vectorized - that streaming is the future and it's going to be used everywhere. And I'm sure you've learned a lot through talking to customers and actually deploying like Redpanda down to the wild. So like, what is something surprising that you've learned? Alex : I feel like I'm surprised almost every day. People are so interesting with the stuff that they're doing. It's very cool. Multiple surprises. There's business surprises, there's customer surprises. So from a business, I'm a principal engineer by trade. I was a CTO before, this is the first time I'm being a CEO. It was really, I think, a lot of self-discovery to feel can I do this and how the devil I could do it. So that's one. That was a lot of sort of self-discovery because it started from a place of really wanting to scratch a personal itch. I wrote this storage engine. If you look at the comments, I think to date I'm still at the largest commuter in the repo and I'm the CEO and it's obviously through history. It's because I wrote this storage engine with kind of a page cache bypass and the first allocator and the first compaction strategy because I wanted this to be from a technical level. Then it turns out that when I started interviewing people and we interviewed tens of companies, maybe less than a hundred, but definitely more than 50 somewhere in between, people were having the same problems that I was having personally. [55:02] This varies just from small companies to medium companies, to large companies. Everyone knows we're just struggling with operationalizing Kafka. It takes a lot of expertise or money or both money and talent, which costs money to just get the thing stable. And I was like, we could do better. And so the fact that that was possible, even though Kafka has been around for 11 years, I was shocked. I was like wow. There's like this huge opportunity. I'll make it simple. And you know, what the interesting part about that is that the JavaScript and Python ecosystem, they love us because they're used to running on JS and engine X and this little processes, I mean like later in terms of footprint, like sophisticated code and they're like, oh, Redpanda does this really simple thing that I can run myself and it's easy. They feel we sort of empower some of these developers that would never come over to the JVM and Kafka ecosystem just because it's so complicated and so just to basically productionize it easy to get started, let me be clear, it’s hard to run into stable in production. And so that was a really surprising thing. And then from a customer level the thing we used for the oil and gas was really, I think, revealing. The embedding use case and the edge IOT use case was I was blown because I've been using Kafka as a centralized sort of methods hub where everything goes. I never thought of it as it being able to power IOT, like things. Or like this intrusion detection system, where they wanted this consistent API between their cluster and their local processes and, Redpanda was like the perfect fit for that. I think in the modern world, there's going to be a lot more users of that and I think that sort of people pushing the boundaries. I think there's been a lot of surprising things, but those are good highlights. Utsav : Have you been just surprised at the number of people who use Kafka more than anything else or were you like kind of expecting it? Alex : I've been blown. I think there's two parts to that. One streaming is an AdSense market. I think Concord just filed their SCC recently. I think saw that today. I think they're going to have a massively successful life beyond. I wish them luck and success, because the bigger they are, the bigger we are too. We add to the ecosystem. I actually see us expanding the Kafka API users to other things that couldn't be done before. I think it's big in terms of its total size, but it also think is big in that it's growing. And so the number of new users that are coming to us about like, oh, I'm thinking about this business idea, how would you do it in real time? And here's, what's interesting about that. You and I put pressure on products that scientifically translate to things like Redpanda. Say I want to order food. My wife is pregnant and she wants Korean food tomorrow, but she doesn't eat meat, but when she's pregnant, she wants to eat meat. And so I want to order that and it's like, I want to be notified every five minutes. What's going on with the food? Is it going to get here? Blah, blah, blah. And so end users end up putting pressure on companies. A software doesn't run on category theory, it runs on this real hardware and ultimately it ends up in something looking like red Panda. And so I think that's interesting. There's a ton of new users where they're getting asked by end users, like you and I, to effectively transform the enterprise into this real time business and start to extract value out of the now, what’s happening literally right now. I think to them, once they learn that they can do this, that Redpanda empowers them to do this, they never want to go back to batch. Streaming is also strict superset of batch; without getting theoretical here. Once you can start to extract value out of what's happening in your business in real time, nobody's ever want to go back. So I think it’s those two things. One, the market is large and two, it is growing really fast and so that was surprising to me. Utsav : Cool. This is a final question. You don't have to answer this. This is just a standard check. If Confluence CEO came to you tomorrow and said, we just want to buy Redpanda, what would you think of that? Alex : I don't know. I think from a company perspective, I want to take the company, myself, to be a public company and I think there's plenty of room. Like the pie is actually giant. And I just think there hasn't been a lot of competition in this space. It's my view, at least. [1:00:00] Yeah and so I think we're making it better. Let me give you an example. Apache web server dominated the HTTP standards for a long time. It almost didn't matter what the standard said is like Apache web server was like the thing. If anything, they implemented that was the standard implied. Then NGINX came about, and people were like, wait, hold on a second, there's this new thing. And it sort of actually kicked off this new, I think, interest in trying to standardize and develop the protocol farther. I think it's similar. I think it's only natural to happen to the Kafka API. The Pulsar is also trying to bring in the Kafka API. We say, this is our first class citizen and so, I think that there's room for multiple players who offer different things. We offer extremely low latency safety. So we get to take advantage of the hardware. And so I think people are really attracted to our offering from a technical perspective, especially for new use cases. And yeah, and so I don't know. I think there's a chance for multiple players in that. Utsav : Yeah. It's exciting. I think like there's open telemetric. There'll be like an open streaming API or something that eventually there'll be like a working group that will be folded into CNCF and all those things. Alex : Exactly. Utsav : Well, thanks Alex again for being a guest. I think this was a lot of fun and it is fascinating to me. I did not realize like how many people need streaming and are using streaming. I guess the IOT use case just like slipped my mind, but it makes so much sense. This was super informative and thank you for being a guest. Alex: Thanks for having me. It was a pleasure. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
John Egan is the CEO and Co-Founder of Kintaba , an incident management platform. He was the co-creator of Workplace by Facebook, and previously built Caffeinated Mind , a file transfer company, which was acquired by Facebook. In this episode, our focus is on incident management tools and culture. We discuss learnings about incident management through John’s personal experiences at his startup and at Facebook, and his observations through customers of Kintaba. We explore the stage at which a company might be interested in having an incident response tool, the surprising adoption of such tools outside of engineering teams, the benefits of enforcing cultural norms via tools, and whether such internal tools should lean towards being opinionated or flexible. We also discuss postmortem culture and how the software industry moves forward by learning through transparency of failures. Apple Podcasts | Spotify | Google Podcasts Video Highlights This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Nipunn Koorapati was a Software Engineer at Dropbox, where he worked on two distinct areas - Developer Productivity and Client Sync. He drove many initiatives like consolidating various repositories into a server-side monorepo ( read more here ), and was part of a high leverage project to rewrite the sync engine , a core part of Dropbox’s business. Apple Podcasts | Spotify | Google Podcasts I worked with Nipunn in 2020, and we discovered interesting but unsurprising similarities between the software challenges facing Git and Dropbox. We explore some of the reasons why a tech organization might want to consolidate repositories, some of the improvements being developed into Git like partial clones and sparse checkouts , the similarities between Git and Dropbox, how to think about and roll out a massive, business-critical rewrite, and more. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev…
 
Loading …

ברוכים הבאים אל Player FM!

Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.

 

מדריך עזר מהיר

האזן לתוכנית הזו בזמן שאתה חוקר
הפעלה