talk-data.com talk-data.com

Topic

Data Science

machine_learning statistics analytics

1516

tagged

Activity Trend

68 peak/qtr
2020-Q1 2026-Q1

Activities

1516 activities · Newest first

Vin and I chat about the challenges of writing books, how companies can mature with data science, why data scientists need to learn strategy, and much more.

(We experienced a slight internet delay around the 15:30 mark, otherwise great)

LinkedIn: https://www.linkedin.com/in/vineetvashishta/

Book: https://www.amazon.com/Data-Profit-Businesses-Leverage-Bottom/dp/1394196210

Site: https://www.datascience.vin/


If you like this show, give it a 5-star rating on your favorite podcast platform.

Purchase Fundamentals of Data Engineering at your favorite bookseller.

Subscribe to my Substack: https://joereis.substack.com/

Existe ainda muitas discussões sobre a regulamentação de IA no Brasil visando o impacto positivo, para promover a ética e transparência no uso da tecnologia, para impulsionar a inovação e proteger os direitos dos cidadãos.

E para abortar sobre este assunto, neste episódio do Data Hackers — a maior comunidade de AI e Data Science do Brasil-, conheçam uma uma das principais referências no tema, o Diogo Cortiz — Professor da PUC-SP , cientista cognitivo, futurista, palestrante e criador de conteúdo ; que conta neste episódio sobre como anda o cenário atual da para regulamentação de IA no Brasil .

Lembrando que você pode encontrar todos os podcasts da comunidade Data Hackers no Spotify, iTunes, Google Podcast, Castbox e muitas outras plataformas. Caso queira, você também pode ouvir o episódio aqui no post mesmo !

Falamos no episódio

Conheça nosso convidado:

Diogo Cortiz: https://www.linkedin.com/in/diogocortiz/

Bancada Data Hackers:

Paulo Vasconcellos Gabriel Lages Monique Femme

Links de referências:

Participe do Challenge’23 do Data Hackers: https://www.kaggle.com/datasets/datahackers/state-of-data-2022/discussion/415994 Matéria da Forbes: https://forbes.com.br/forbes-tech/2023/06/por-que-as-empresas-estao-despreparadas-para-os-riscos-da-ia/ Palestino é preso após o Facebook traduzir errado o seu “bom dia” : https://gq.globo.com/Prazeres/Tecnologia/noticia/2017/10/palestino-e-preso-apos-o-facebook-traduzir-errado-o-seu-bom-dia.html Projeto de Lei 38/2023: https://www.camara.leg.br/proposicoesWeb/fichadetramitacao?idProposicao=2345695

Summary

All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team! Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack

Interview

Introduction How did you get involved in the area of data management? A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation? Is it possible to completely avoid having to invest in a migration? What are the signals that point to the need for a migration?

What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one) What are some signals that a migration is not the right solution for a perceived problem?

Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution? What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)? What are some of the ways that a migration effort might fail? What are the major pitfalls that teams need to be aware of as they work through a data platform migration? What are the opportunities for automation during the migration process? What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations? What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons?

Contact Info

Gleb

LinkedIn @glebmm on Twitter

Rob

LinkedIn RobGoretsky on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Datafold

Podcast Episode

Informatica Airflow Snowflake

Podcast Episode

Redshift Eventbrite Teradata BigQuery Trino EMR == Elastic Map-Reduce Shadow IT

Podcast Episode

Mode Analytics Looker Sunk Cost Fallacy data-diff

Podcast Episode

SQLGlot Dagster dbt

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Hex: Hex Tech Logo

Hex is a collaborative workspace for data science and analytics. A single place for teams to explore, transform, and visualize data into beautiful interactive reports. Use SQL, Python, R, no-code and AI to find and share insights across your organization. Empower everyone in an organization to make an impact with data. Sign up today at [dataengineeringpodcast.com/hex](https://www.dataengineeringpodcast.com/hex} and get 30 days free!Rudderstack: Rudderstack

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackSupport Data Engineering Podcast

Building Data Science Applications with FastAPI - Second Edition

Building Data Science Applications with FastAPI is your comprehensive guide to mastering the FastAPI framework to build efficient, reliable data science applications and APIs. You'll explore examples and projects that integrate machine learning models, manage databases, and leverage advanced FastAPI features like asynchronous I/O and WebSockets. What this Book will help me do Develop an understanding of the fundamentals and advanced features of the FastAPI framework, like dependency injection and type hinting. Learn how to integrate machine learning models into a FastAPI-based web backend effectively. Master concepts of authentication, database connections, and asynchronous programming in Python. Build and deploy two practical AI applications: a real-time object detection tool and a text-to-image generator. Acquire skills to monitor, log, and maintain software systems for optimal performance and reliability. Author(s) François Voron is an experienced Python developer and data scientist with extensive knowledge of western frameworks including FastAPI. With years of experience designing and deploying machine learning and data science applications, François focuses on empowering developers with practical techniques and real-world applications. His guidance helps readers tackle contemporary challenges in software development. Who is it for? This book is ideal for data scientists and software engineers looking to broaden their skillset by creating robust web APIs for data science applications. Readers are expected to have a working knowledge of Python and basic data science concepts, offering them a chance to expand into backend development. If you're keen to deploy machine learning models and integrate them seamlessly with web technologies, this book is for you. It provides both fundamental insights and advanced techniques to serve a broad range of learners.

Learnings From the Field: Migration From Oracle DW and IBM DataStage to Databricks on AWS

Legacy data warehouses are costly to maintain, unscalable and cannot deliver on data science, ML and real-time analytics use cases. Migrating from your enterprise data warehouse to Databricks lets you scale as your business needs grow and accelerate innovation by running all your data, analytics and AI workloads on a single unified data platform.

In the first part of this session we will guide you through the well-designed process and tools that will help you from the assessment phase to the actual implementation of an EDW migration project. Also, we will address ways to convert PL/SQL proprietary code to an open standard python code and take advantage of PySpark for ETL workloads and Databricks SQL’s data analytics workload power.

The second part of this session will be based on an EDW migration project of SNCF (French national railways); one of the major enterprise customers of Databricks in France. Databricks partnered with SNCF to migrate its real estate entity from Oracle DW and IBM DataStage to Databricks on AWS. We will walk you through the customer context, urgency to migration, challenges, target architecture, nitty-gritty details of implementation, best practices, recommendations, and learnings in order to execute a successful migration project in a very accelerated time frame.

Talk by: Himanshu Arora and Amine Benhamza

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

JetBlue has embarked over the past year on an AI and ML transformation. Databricks has been instrumental in this transformation due to the ability to integrate streaming pipelines, ML training using MLflow, ML API serving using ML registry and more in one cohesive platform. Using real-time streams of weather, aircraft sensors, FAA data feeds, JetBlue operations and more are used for the world's first AI and ML operating system orchestrating a digital-twin, known as BlueSky for efficient and safe operations. JetBlue has over 10 ML products (multiple models each product) in production across multiple verticals including dynamic pricing, customer recommendation engines, supply chain optimization, customer sentiment NLP and several more.

The core JetBlue data science and analytics team consists of Operations Data Science, Commercial Data Science, AI and ML engineering and Business Intelligence. To facilitate the rapid growth and faster go-to-market strategy, the team has built an internal Data Catalog + AutoML + AutoDeploy wrapper called BlueML using Databricks features to empower data scientists including advanced analysts with the ability to train and deploy ML models in less than five lines of code.

Talk by: Derrick Olson and Rob Bajra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Comparing Databricks and Snowflake for Machine Learning

Snowflake and Databricks both aim to provide data science toolkits for machine learning workflows, albeit with different approaches and resources. While developing ML models is technically possible using either platform, the Hitachi Solutions Empower team tested which solution will be easier, faster, and cheaper to work with in terms of both user experience and business outcomes for our customers. To do this, we designed and conducted a series of experiments with use cases from the TPCx-AI benchmark standard. We developed both single-node and multi-node versions of these experiments, which sometimes required us to set up separate compute infrastructure outside of the platform, in the case of Snowflake. We also built datasets of various sizes (1GB, 10GB, and 100GB), to assess how each platform/node setup handles scale.

Based on our findings, on the average, Databricks is faster, cheaper, and easier to use for developing machine learning models, and we use it exclusively for data science on the Empower platform. Snowflake’s reliance on third party resources for distributed training is a major drawback, and the need to use multiple compute environments to scale up training is complex and, in our view, an unnecessary complication to achieve best results.

Talk by: Michael Green and Don Scott

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Leveraging IoT Data at Scale to Mitigate Global Water Risks Using Apache Spark™ Streaming and Delta

Every year, billions of dollars are lost due to water risks from storms, floods, and droughts. Water data scarcity and excess are issues that risk models cannot overcome, creating a world of uncertainty. Divirod is building a platform of water data by normalizing diverse data sources of varying velocity into one unified data asset. In addition to publicly available third-party datasets, we are rapidly deploying our own IoT sensors. These sensors ingest signals at a rate of about 100,000 messages per hour into preprocessing, signal-processing, analytics, and postprocessing workloads in one spark-streaming pipeline to enable critical real-time decision-making processes. By leveraging streaming architecture, we were able to reduce end-to-end latency from tens of minutes to just a few seconds.

We are leveraging Delta Lake to provide a single query interface across multiple tables of this continuously changing data. This enables data science and analytics workloads to always use the most current and comprehensive information available. In addition to the obvious schema transformations, we implement data quality metrics and datum conversions to provide a trustworthy unified dataset.

Talk by: Adam Wilson and Heiko Udluft

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Accelerating the Development of Viewership Personas with a Unified Feature Store

With the proliferation of video content and flourishing consumer demand, there is an enormous opportunity for customer-centric video entertainment companies to use data and analytics to understand what their viewers want and deliver more of the content that that meets their needs.

At DIRECTV, our Data Science Center of Excellence is constantly looking to push the boundary of innovation in how we can better and more quickly understand the needs of our customers and leverage those actionable insights to deliver business impact. One way in which we do so is through the development of Viewership Personas with cluster analysis at scale to group our customers by the types of content they enjoy watching. This process is significantly accelerated by a unified feature store which contain a wide array of features that captures key information on viewing preferences.

This talk will focus on how the DIRECTV Data Science team utilizes Databricks to help develop a unified feature store, and learn how we leverage the feature store to accelerate the process of running machine learning algorithms to find meaningful viewership clusters.

Talk by: Malav Shah,Taylor Hosbach

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Planning and Executing a Snowflake Data Warehouse Migration to Databricks

Organizations are going through a critical phase of data infrastructure modernization, laying the foundation for the future, and adapting to support growing data and AI needs. Organizations that embraced cloud data warehouses (CDW) such as Snowflake have ended up trying to use a data warehousing tool for ETL pipelines and data science. This created unnecessary complexity and resulted in poor performance since data warehouses are optimized for SQL-based analytics only.

Realizing the limitation and pain with cloud data warehouses, organizations are turning to a lakehouse-first architecture. Though a cloud platform to cloud platform migration should be relatively easy, the breadth of the Databricks platform provides flexibility and hence requires careful planning and execution. In this session, we present the migration methodology, technical approaches, automation tools, product/feature mapping, a technical demo and best practices using real-world case studies for migrating data, ELT pipelines and warehouses from Snowflake to Databricks.

Talk by: Satish Garla and Ramachandran Venkat

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Sponsored by: Immuta | Building an End-to-End MLOps Workflow with Automated Data Access Controls

WorldQuant Predictive’s customers rely on our predictions to understand how changing world and market conditions will impact decisions to be made. Speed is critical, and so are accuracy and resilience. To that end, our data team built a modern, automated MLOps data flow using Databricks as a key part of our data science tooling, and integrated with Immuta to provide automated data security and access control.

In this session, we will share details of how we used policy-as-code to support our globally distributed data science team with secure data sharing, testing, validation and other model quality requirements. We will also discuss our data science workflow that uses Databricks-hosted MLflow together with an Immuta-backed custom feature store to maximize speed and quality of model development through automation. Finally, we will discuss how we deploy the models into our customized serverless inference environment, and how that powers our industry solutions.

Talk by: Tyler Ditto

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Databricks to Develop Stats & Mathe Models to Forecast Monkeypox Outbreak in Washington State

In the spring and summer of 2022, monkeypox was detected in the United States and quickly spread throughout the country. To contain and mitigate the spread of the disease in Washington State, the Washington Department of Health data science team used the Databricks platform to develop a modeling pipeline that employed statistical and mathematical techniques to forecast the course of the monkeypox outbreak throughout the state. These models provided actionable information that helped inform decision making and guide the public health response to the outbreak.

We used contact-tracing data, standard line-lists, and published parameters to train a variety of time-series forecasting models, including an ARIMA model, a Poisson regression, and an SEIR compartmental model. We also calculated the daily R-effective rate as an additional output. The compartmental model best fit the reported cases when tested out of sample, but the statistical models were quicker and easier to deploy and helped inform initial decision-making. The R-effective rate was particularly useful throughout the effort.

Overall, these efforts highlighted the importance of rapidly deployable and scalable infectious disease modeling pipelines. Public health data science is still a nascent field, however, so common best practices in other industries are often-times novel approaches in public health. The need for stable, generalizable pipelines is crucial. Using the Databricks platform has allowed us to more quickly scale and iteratively improve our modeling pipelines to include other infectious diseases, such as influenza and RSV. Further development of scalable and standardized approaches to disease forecasting at the state and local level is vital to better informing future public health response efforts.

Talk by: Matthew Doxey

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Leveraging Data Science for Game Growth and Matchmaking Optimization

For video games, Data Science solutions can be applied throughout players' lifecycle, from Adtech, LTV forecasting, In-game economic system monitoring to experimentation. Databricks is used as a data and computation foundation to power these data science solutions, enabling data scientists to easily develop and deploy these solutions for different use cases.

In this session, we will share insights on how Databricks-powered data science solutions drive game growth and improve player experiences using different advanced analytics, modeling, experimentation, and causal inference methods. We will introduce the business use cases, data science techniques, as well as Databricks demos.

Talk by: Zhenyu Zhao and Shuo Chen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsor: Sigma Computing | Using Sigma Input Tables Helps Databricks Data Science & ML Apps

In this talk, we will dive into the powerful analytics combination of Databricks and Sigma for data science and machine learning use cases. Databricks offers a scalable and flexible platform for building and deploying machine learning models at scale, while Sigma enhances this framework with real-time data insights, analysis, and visualization capabilities. Going a step further, we will demonstrate how input tables can be utilized from Sigma to create seamless workflows in Databricks for data science and machine learning. From this workflow, business users can leverage data science and ML models to do ad-hoc analysis and make data-driven decisions.

This talk is perfect for data scientists, data analysts, business users, and anyone interested in harnessing the power of Databricks and Sigma to drive business value. Join us and discover how these two platforms can revolutionize the way you analyze and leverage your data.

Talk by: Mitch Ertle and Greg Owen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How to Create and Manage a High-Performance Analytics Team

Data science and analytics teams are unique. Large and small corporations want to build and manage analytics teams to convert their data and analytic assets into revenue and competitive advantage, but many are failing before they make their first hire. In this session, the audience will learn how to structure, hire, manage and grow an analytics team. Organizational structure, project and program portfolios, neurodiversity, developing talent, and more will be discussed.

Questions and discussion will be encouraged and engaged in. The audience will leave with a deeper understanding of how to succeed in turning data and analytics into tangible results.

Talk by: John Thompson

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

LLM in Practice: How to Productionize Your LLMs

Ask questions from a panel of data science experts who have deployed LLMs and AI models into production.

Talk by: David Talby, Conor Murphy, Cheng Yin Eng, Sam Raymond, and Colton Peltier

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

In this episode of Data Career Podcast, I explore Data Analytics vs Data Science, highlighting key differences and stress the fluidity between data science and analytics. Don't miss it!

📩 Get my weekly email with helpful data career tips 📊 Come to my next free “How to Land Your First Data Job” training 🏫 Check out my 10-week data analytics bootcamp

Timestamps: (00:56) - Predictive Analytics: Foreseeing the Future 🔮📈 (01:43) - Diagnostic Analytics: Unraveling "Why" 🕵️‍♂️🔍 (02:10) - Analytics in Book Sales 📚📊 (03:09) - Data Scientists: Masters of Prediction 🧠🎯 (03:56) - Decoding Data Analytics vs. Data Science 🗝️💻 (05:26) - Gateway to Success: Landing a Data Job 🚪💼

Connect with Avery: 📺 Subscribe on YouTube 🎙Listen to My Podcast 👔 Connect with me on LinkedIn 📸 Instagram 🎵 TikTok Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Build Your Data Lakehouse with a Modern Data Stack on Databricks

Are you looking for an introduction to the Lakehouse and what the related technology is all about? This session is for you. This session explains the value that lakehouses bring to the table using examples of companies that are actually modernizing their data, showing demos throughout. The data lakehouse is the future for modern data teams that want to simplify data workloads, ease collaboration, and maintain the flexibility and openness to stay agile as a company scales.

Come to this session and learn about the full stack, including data engineering, data warehousing in a lakehouse, data streaming, governance, and data science and AI. Learn how you can create modern data solutions of your own.

Talk by: Ari Kaplan and Pearl Ubaru

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc