talk-data.com talk-data.com

Topic

Analytics

data_analysis insights metrics

227

tagged

Activity Trend

398 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Data Engineering Podcast ×

Summary The PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first choice for high performance analytics. Swarm64 aims to change that by adding support for advanced hardware capabilities like FPGAs and optimized usage of modern SSDs. In this episode CEO and co-founder Thomas Richter discusses his motivation for creating an extension to optimize Postgres hardware usage, the benefits of running your analytics on the same platform as your application, and how it works under the hood. If you are trying to get more performance out of your database then this episode is for you!

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required. Your host is Tobias Macey and today I’m interviewing Thomas Richter about Swarm64, a PostgreSQL extension to improve parallelism and add support for FPGAs

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Swarm64 is?

How did the business get started and what keeps you motivated?

What are some of the common bottlenecks that users of postgres run into? What are the use cases and workloads that gain the most benefit from increased parallelism in the database engine? By increasing the processing throughput of the database, how does that impact disk I/O and what are some options for avoiding bottlenecks in the persistence layer? Can you describe how Swarm64 is implemented?

How has the product evolved since you first began working on it?

How has the evolution of postgres impacted your product direction?

What are some of the notable challenges that you have dealt with as a result of upstream changes in postgres?

How has the hardware landscape evolved and how does that affect your prioritization of features and improvements? What are some of the other extensions in the postgres ecosystem that are most commonly used alongside Swarm64?

Which extensions conflict with yours and how does that impact potential adoption?

In addition to your work to optimize performance of the postres engine, you also provide support for using an FPGA as a co-processor. What are the benefits that an FPGA provides over and above a CPU or GPU architecture?

What are the available options for provisioning hardware in a datacenter or the cloud that has access to an FPGA? Most people are familiar with the relevant attributes for selecting a CPU or GPU, what are the specifications that they should be looking at when selecting an FPGA?

For users who are adopting Swarm64, how does it impact the way they should be thinking of their data models? What is involved in migrating an existing database to use Swarm64? What are some of the most interesting, unexpected, or

Summary Data is a critical element to every role in an organization, which is also what makes managing it so challenging. With so many different opinions about which pieces of information are most important, how it needs to be accessed, and what to do with it, many data projects are doomed to failure. In this episode Chris Bergh explains how taking an agile approach to delivering value can drive down the complexity that grows out of the varied needs of the business. Building a DataOps workflow that incorporates fast delivery of well defined projects, continuous testing, and open lines of communication is a proven path to success.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! If DataOps sounds like the perfect antidote to your pipeline woes, DataKitchen is here to help. DataKitchen’s DataOps Platform automates and coordinates all the people, tools, and environments in your entire data analytics organization – everything from orchestration, testing and monitoring to development and deployment. In no time, you’ll reclaim control of your data pipelines so you can start delivering business value instantly, without errors. Go to dataengineeringpodcast.com/datakitchen today to learn more and thank them for supporting the show! Your host is Tobias Macey and today I’m welcoming back Chris Bergh to talk about ways that DataOps principles can help to reduce organizational complexity

Interview

Introduction How did you get involved in the area of data management? How are typical data and analytic teams organized? What are their roles and structure? Can you start by giving an outline of the ways that complexity can manifest in a data organization?

What are some of the contributing factors that generate this complexity? How does the size or scale of an organization and their data needs impact the segmentation of responsibilities and roles?

How does this organizational complexity play out within a single team? For example between data engineers, data scientists, and production/operations? How do you approach the definition of useful interfaces between different roles or groups within an organization?

What are your thoughts on the relationship between the multivariate complexities of data and analytics workflows and the software trend toward microservices as a means of addressing the challenges of organizational communication patterns in the software lifecycle?

How does this organizational complexity play out between multiple teams? For example between centralized data team and line of business self service teams? Isn’t organizational complexity just ‘the way it is’? Is there any how in getting out of meetings and inter team conflict? What are some of the technical elements that are most impactful in reducing the time to delivery for different roles? What are some strategies that you have found to be useful for maintaining a connection to the business need throughout the different stages of the data lifecycle? What are some of the signs or symptoms of problematic complexity that individuals and organizations should keep an eye out for? What role can automated testing play in improving this process? How do the current set of tools contribute to the fragmentation of data wor

Summary Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources of information. In this episode John Maiden talks about how Cherre builds knowledge graphs that provide powerful insights for their customers and the engineering challenges of building a scalable graph. If you’re wondering how to extract additional business value from existing data, this episode will provide a way to expand your data resources.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. We have partnered with organizations such as ODSC, and Data Council. Upcoming events include ODSC East which has gone virtual starting April 16th. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing John Maiden about how Cherre is building and using a knowledge graph of commercial real estate information

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Cherre is and the role that data plays in the business? What are the benefits of a knowledge graph for making real estate investment decisions? What are the main ways that you and your customers are using the knowledge graph?

What are some of the challenges that you face in providing a usable interface for end-users to query the graph?

What technology are you using for storing and processing the graph?

What challenges do you face in scaling the complexity and analysis of the graph?

What are the main sources of data for the knowledge graph? What are some of the ways that messiness manifests in the data that you are using to populate the graph?

How are you managing cleaning of the data and how do you identify and process records that can’t be coerced into the desired structure? How do you handle missing attributes or extra attributes in a given record?

How did you approach the process of determining an effective taxonomy for records in the graph? What is involved in performing entity extraction on your data? What are some of the most interesting or unexpected questions that you have been able to ask and answer with the graph? What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with this data? What are some of the near and medium term improvements that you have planned for your knowledge graph? What advice do you have for anyone who is interested in building a knowledge graph of their own?

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for liste

Summary CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve ClickHouse, the open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Adam Kocoloski about CouchDB and the work being done to migrate the storage layer to FoundationDB

Interview

Introduction How did you get involved in the area of data management? Can you starty by describing what CouchDB is?

How did you get involved in the CouchDB project and what is your current role in the community?

What are the use cases that it is well suited for? Can you share some of the history of CouchDB and its role in the NoSQL movement? How is CouchDB currently architected and how has it evolved since it was first introduced? What have been the benefits and challenges of Erlang as the runtime for CouchDB? How is the current storage engine implemented and what are its shortcomings? What problems are you trying to solve by replatforming on a new storage layer?

What were the selection criteria for the new storage engine and how did you structure the decision making process? What was the motivation for choosing FoundationDB as opposed to other options such as rocksDB, levelDB, etc.?

How is the adoption of FoundationDB going to impact the overall architecture and implementation of CouchDB? How will the use of FoundationDB impact the way that the current capabilities are implemented, such as data replication? What will the migration path be for people running an existing installation? What are some of the biggest challenges that you are facing in rearchitecting the codebase? What new capabilities will the FoundationDB storage layer enable? What are some of the most interesting/unexpected/innovative ways that you have seen CouchDB used?

What new capabilities or use cases do you anticipate once this migration is complete?

What are some of the most interesting/unexpected/challenging lessons that you have learned while working with the CouchDB project and community? What is in store for the future of CouchDB?

Contact Info

LinkedIn @kocolosk on Twitter kocolosk on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Apache CouchDB FoundationDB

Podcast Episode

IBM Cloudant Experimental Particle Physics FPGA == Field Programmable Gate Array Apache Software Foundation CRDT == Conflict-free Replicated Data Type

Podcast Episode

Erlang Riak RabbitMQ Heisenbug Kubernetes Property Based Testing

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sean Knapp, Charlie Crocker about shadow IT in data and analytics

Interview

Introduction How did you get involved in the area of data management? Can you start by sharing your definition of shadow IT? What are some of the reasons that members of an organization might start building their own solutions outside of what is supported by the engineering teams?

What are some of the roles in an organization that you have seen involved in these shadow IT projects?

What kinds of tools or platforms are well suited for being provisioned and managed without involvement from the platform team?

What are some of the pitfalls that these solutions present as a result of their initial ease of use?

What are the benefits to the organization of individuals or teams building and managing their own solutions? What are some of the risks associated with these implementations of data collection, storage, man

Summary One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Josh Beemster about how Snowplow manages deployment and maintenance of their managed service in their customer’s cloud accounts.

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of the components in your system architecture and the nature of your managed service? What are some of the challenges that are inherent to private SaaS nature of your managed service? What elements of your system require the most attention and maintenance to keep them running properly? Which components in the pipeline are most subject to variability in traffic or resource pressure and what do you do to ensure proper capacity? How do you manage deployment of the full Snowplow pipeline for your customers?

How has your strategy for deployment evolved since you first began Soffering the managed service? How has the architecture of the pipeline evolved to simplify operations?

How much customization do you allow for in the event that the customer has their own system that they want to use in place of one of your supported components?

What are some of the common difficulties that you encounter when working with customers who need customized components, topologies, or event flows?

How does that reflect in the tooling that you use to manage their deployments?

What types of metrics do you track and what do you use for monitoring and alerting to ensure that your customers pipelines are running smoothly? What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with and on Snowplow? What are some lessons that you can generalize for management of data infrastructure more broadly? If you could start over with all of Snowplow and the infrastructure automation for it today, what would you do differently? What do you have planned for the future of the Snowplow product and infrastructure management?

Contact Info

LinkedIn jbeemster on GitHub @jbeemster1 on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Snowplow Analytics

Podcast Episode

Terraform Consul Nomad Meltdown Vulnerability Spectre Vulnerability AWS Kinesis Elasticsearch SnowflakeDB Indicative S3 Segment AWS Cloudwatch Stackdriver Apache Kafka Apache Pulsar Google Cloud PubSub AWS SQS AWS SNS AWS Redshift Ansible AWS Cloudformation Kubernetes AWS EMR

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve Clickhouse, the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for Clickhouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about data vault modeling and the role that it plays in the current data landscape

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what data vault modeling is and how it differs from other approaches such as third normal form or the star/snowflake schema?

What is the history of this approach and what limitations of alternate styles of modeling is it attempting to overcome? How did you first encounter this approach to data modeling and what is your motivation for dedicating so much time and energy to promoting it?

What are some of the primary challenges associated with data modeling that contribute to the long lead times for data requests or o

Summary Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn shares their experience migrating an existing set of streaming workflows onto the Ascend platform after their previous vendor was acquired and changed their offering. This is an interesting discussion about the ongoing maintenance and decision making required to keep your business data up to date and accurate.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sheel Choksi and Sean Knapp about Mayvenn’s experience migrating their dataflows onto the Ascend platform

Interview

Introduction How did you get involved in the area of data management? Can you start off by describing what Mayvenn is and give a sense of how you are using data? What are the sources of data that you are working with? What are the biggest challenges you are facing in collecting, processing, and analyzing your data? Before adopting Ascend, what did your overall platform for data management look like? What were the pain points that you were facing which led you to seek a new solution?

What were the selection criteria that you set forth for addressing your needs at the time? What were the aspects of Ascend which were most appealing?

What are some of the edge cases that you have dealt with in the Ascend platform? Now that you have been using Ascend for a while, what components of your previous architecture have you been able to retire? Can you talk through the migration process of incorporating Ascend into your platform and any validation that you used to ensure that your data operations remained accurate and consistent? How has the migration to Ascend impacted your overall capacity for processing data or integrating new sources into your analytics? What are your future plans for how to use data across your organization?

Contact Info

Sheel

LinkedIn sheelc on GitHub

Sean

LinkedIn @seanknapp on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is b

Summary Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Materialize is and the problems that you are aiming to solve with it?

What was your motivation for creating it?

What use cases does Materialize enable?

What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize? How does it fit into the broader ecosystem of data tools and platforms?

What are some of the use cases that Materialize is uniquely able to support? How is Materialize architected and how has the design evolved since you first began working on it? Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided?

What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems?

In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize?

A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or

Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email [email protected] with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Marquez is?

What was missing in existing metadata management platforms that necessitated the creation of Marquez?

How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?

How does it compare to the Amundsen platform that Lyft recently released?

What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see? What are some of the capabilities that are unique to Marquez and how are you using them at WeWork? What are the primary resource types that you support in Marquez?

What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?

Can you explain how Marquez is architected and how the design has evolved since you first began working on it?

Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?

What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?

How is the metadata itself stored and managed in Marquez?

How much up-front data modeling is necessary and what types of schema representations are supported?

Can you talk through the overall workflow of someone using Marquez in their environment?

What is involved in registering and updating datasets? How do you define and track the health of a given dataset? What are some of the interesting questions that can be answered from the information stored in Marquez?

What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases? For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it? What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform? When is Marquez the wrong choice for a metadata repository? What do you have planned for the future of Marquez?

Contact Info

Julien Le Dem

@J_ on Twitter Email julienledem on GitHub

Willy

LinkedIn @wslulciuc on Twitter wslulciuc on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Marquez

DataEngConf Presentation

WeWork Canary Yahoo Dremio Hadoop Pig Parquet

Podcast Episode

Airflow Apache Atlas Amundsen

Podcast Episode

Uber DataBook LinkedIn DataHub Iceberg Table Format

Podcast Episode

Delta Lake

Podcast Episode

Great Expectations data pipeline unit testing framework

Podcast.init Episode

Redshift SnowflakeDB

Podcast Episode

Apache Kafka Schema Registry

Podcast Episode

Open Tracing Jaeger Zipkin DropWizard Java framework Marquez UI Cayley Graph Database Kubernetes Marquez Helm Chart Marquez Docker Container Dagster

Podcast Episode

Luigi DBT

Podcast Episode

Thrift Protocol Buffers

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss"…

Summary The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you mean by the term "Data Orchestration"?

How does it compare to the concept of "Data Virtualization"? What are some of the tools and platforms that fit under that umbrella?

What are some of the motivations for organizations to use the cloud for their data oriented workloads?

What are they giving up by using cloud resources in place of on-premises compute?

For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments? What are some of the common patterns for cloud migration projects and what challenges do they present?

Do you have advice on useful metrics to track for determining project completion or success criteria?

How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals? Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort?

What are some of the common pain points that organizations encounter when working on hybrid implementations?

What are some of the missing pieces in the data orchestration landscape?

Are there any efforts that you are aware of that are aiming to fill those gaps?

Where is the data orchestration market heading, and what are some industry trends that are driving it?

What projects are you most interested in or excited by?

For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend?

Contact Info

LinkedIn @dborkar on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Alluxio

Podcast Episode

UC San Diego Couchbase Presto

Podcast Episode

Spark SQL Data Orchestration Data Virtualization PyTorch

Podcast.init Episode

Rook storage orchestration PySpark MinIO

Podcast Episode

Kubernetes Openstack Hadoop HDFS Parquet Files

Podcast Episode

ORC Files Hive Metastore Iceberg Table Format

Podcast Episode

Data Orchestration Summit Star Schema Snowflake Schema Data Warehouse Data Lake Teradata

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. Are you working on data, analytics, or AI using platforms such as Presto, Spark, or Tensorflow? Check out the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off of your ticket, and the first five people to register get free tickets! Register now as early bird tickets are ending this week! Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Lewis Hemens about DataForm, a platform that helps analy

Summary The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Rockset is and your motivation for creating it?

What are some of the use cases that it enables which would otherwise be impractical or intractable?

How does Rockset fit into the infrastructure and workflow of data teams and what portions of a typical stack does it replace? Can you describe how the Rockset platform is architected and how it has evolved as you onboard more customers? Can you describe the flow of a piece of data as it traverses the full lifecycle in Rockset? How is your storage backend implemented to allow for speed and flexibility in the query layer?

How does it manage distribution, balancing, and durability of the data? What are your strategies for handling node and region failure in the cloud?

You have a whitepaper describing your ar

Summary Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system.

Interview

Introduction How did you get involved in the area of data management? Can you explain what MinIO is and its origin story? What are some of the main use cases that MinIO enables? How does MinIO compare to other object storage options and what benefits does it provide over other open source platforms?

Your marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph)

What are some of the challenges that you face in terms of maintaining compatibility with the S3 interface?

What are the constraints and opportunities that are provided by adhering to that API?

Can you describe how MinIO is implemented and the overall system design?

How has that design evolved since you first began working on it?

What assumptions did you have at the outset and how have they been challenged or updated?

What are the axes for scaling that MinIO provides and how does it handle clustering?

Where does it fall on the axes of availability and consistency in the CAP theorem?

One of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume? For someone who is interested in running MinIO, what is involved in deploying and maintain

Summary The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Swim.ai is and how the project and business got started?

Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer?

What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable? How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms? Can you describe a typical design for an application or system being built on top of the Swim platform?

What does the developer workflow look like?

What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim?

Can you describe the internal design for the SwimOS and ho

Summary Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Pete Soderling about his work to build and grow a community for data professionals with the Data Council conferences and meetups, as well as his experiences as an investor in data oriented companies

Interview

Introduction How did you get involved in the area of data management? What was your original reason for focusing your efforts on fostering a community of data engineers?

What was the state of recognition in the industry for that role at the time that you began your efforts?

The current manifestation of your community efforts is in the form of the Data Council conferences and meetups. Previously they were known as Data Eng Conf and before that was Hakka Labs. Can you discuss the evolution of your efforts to grow this community?

How has the community itself changed and grown over the past few years?

Communities form around a huge variety of focal points. What are some of the complexities or challenges in building one based on something as nebulous as data? Where do you draw inspiration and direction for how to manage such a large and distributed community?

What are some of the most interesting/challenging/unexpected aspects of community management that you have encountered?

What are some ways that you have been surprised or delighted in your interactions with the data community? How do you approach sustainability of the Data Council community and the organization itself? The tagline that you have focused on for Data Council events is that they are no fluff, juxtaposing them against larger business oriented events. What are your guidelines for fulfilling that promise and why do you think that is an important distinction? In addition to your community building you are also an investor. How did you get involved in that side of your business and how does it fit into your overall mission? You also have a stated mission to help engineers build their own companies. In your opinion, how does an engineer led business differ from one that may be founded or run by a business oriented individual and why do you think that we need more of them?

What are the ways that you typically work to empower engineering founders or encourage them to create their own businesses?

What are some of the challenges that engineering founders face and what are some common difficulties or misunderstandings related to business?

What are your opinions on venture-backed vs. "lifestyle" or bootstrapped businesses?

What are the characteristics of a data business that you look at when evaluating a potential investment? What are some of the current industry trends that you are most excited by?

What are some that you find concerning?

What are your goals and plans for the future of Data Council?

Contact Info

@petesoder on Twitter LinkedIn @petesoder on Medium

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Data Council Database Design For Mere Mortals Bloomberg Garmin 500 Startups Geeks On A Plane Data Council NYC 2019 Track Summary Pete’s Angel List Syndicate DataOps

Data Kitchen Episode DataOps Vs DevOps Episode

Great Expectations

Podcast.init Interview

Elementl Dagster

Data Council Presentation

Data Council Call For Proposals

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Benn Stancil, chief analyst at Mode Analytics, about what data engineers need to know when building tools for analysts

Interview

Introduction How did you get involved in the area of data management? Can you start by describing some of the main features that you are looking for in the tools that you use? What are some of the common shortcomings that you have found in out-of-the-box tools that organizations use to build their data stack? What should data engineers be considering as they design and implement the foundational data platforms that higher order systems are built on, which are ultimately used by analysts and data scientists?

In terms of mindset, what are the ways that data engineers and analysts can align and where are the points of conflict?

In terms of team and organizational structure, what have you found to be useful patterns for reducing friction in the product lifecycle for data tools (internal or external)? What are some anti-patterns that data engineers can guard against as they are designing their pipelines? In your experience as an analyst, what have been the characteristics of the most seamless projects that you have been involved with? How much understanding of analytics are necessary for data engineers to be successful in their projects and careers?

Conversely, how much understanding of data management should analysts have?

What are the industry trends that you are most excited by as an analyst?

Contact Info

LinkedIn @bennstancil on Twitter Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for

Summary Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation?

What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source?

Can you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained? Given how long the project has been in use, c

Summary Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who are not involved with the collection and management of that information. Lyft has build the Amundsen platform to address the problem of data discovery and in this episode Tao Feng and Mark Grover explain how it works, why they built it, and how it has impacted the workflow of data professionals in their organization. If you are struggling to realize the value of your information because you don’t know what you have or where it is then give this a listen and then try out Amundsen for yourself.

Announcements

Welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Finding the data that you need is tricky, and Amundsen will help you solve that problem. And as your data grows in volume and complexity, there are foundational principles that you can follow to keep data workflows streamlined. Mode – the advanced analytics platform that Lyft trusts – has compiled 3 reasons to rethink data discovery. Read them at dataengineeringpodcast.com/mode-lyft. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, the Open Data Science Conference, and Corinium Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Grover and Tao Feng about Amundsen, the data discovery platform and metadata engine that powers self service data access at Lyft

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Amundsen is and the problems that it was designed to address?

What was lacking in the existing projects at the time that led you to building a new platform from the ground up?

How does Amundsen fit in the larger ecosystem of data tools?

How does it compare to what WeWork is building with Marquez?

Can you describe the overall architecture of Amundsen and how it has evolved since you began working on it?

What were the main assumptions that you had going into this project and how have they been challenged or updated in the process of building and using it?

What has been the impact of Amundsen on the workflows

Summary Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what CloudFactory is and the story behind it? What are some of the common requirements