talk-data.com talk-data.com

Topic

Data Science

machine_learning statistics analytics

1516

tagged

Activity Trend

68 peak/qtr
2020-Q1 2026-Q1

Activities

1516 activities · Newest first

The English SDK for Apache Spark™

In the fast-paced world of data science and AI, we will explore how large language models (LLMs) can elevate the development process of Apache Spark applications.

We'll demonstrate how LLMs can simplify SQL query creation, data ingestion, and DataFrame transformations, leading to faster development and clearer code that's easier to review and understand. We'll also show how LLMs can assist in creating visualizations and clarifying data insights, making complex data easy to understand.

Furthermore, we'll discuss how LLMs can be used to create user-defined data sources and functions, offering a higher level of adaptability in Apache Spark applications.

Our session, filled with practical examples, highlights the innovative role of LLMs in the realm of Apache Spark development. We invite you to join us in this exploration of how these advanced language models can drive innovation and boost efficiency in the sphere of data science and AI.

Talk by: Gengliang Wang and Allison Wang

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Analytic Literacy

The explosive growth in volume and varieties of data generated by the seemingly endless arrays of digital systems and applications is rapidly elevating the importance of being able to utilize data; in fact, data analytic literacy is becoming as important now, at the onset of the Digital Era, as rudimentary literacy and numeracy were throughout the Industrial Era. And yet, what constitutes data analytic literacy is poorly understood. To some, data analytic literacy is the ability to use basic statistics, to others it is data science ‘light’, and to still others it is just general familiarity with common data analytic outcomes. Exploring the scope and the structure of rudimentary data analytic competencies is at the core of this book which takes the perspective that data analytics is a new and distinct domain of knowledge and practice. It offers application-minded framing of rudimentary data analytic competencies built around conceptually sound and practically meaningful processes and mechanics of systematically transforming messy and heterogeneous data into informative insights. Data Analytic Literacy is meant to offer an easy-to-follow overview of the critical elements of the reasoning behind basic data manipulation and analysis approaches and steps, coupled with the commonly used data analytic and data communication techniques and tools. It offers an all-inclusive guide to developing basic data analytic competencies.

Maddie is a Sr. ML / Research Engineer in industry, published author and seasoned open-source AI leader, with 6+ years of experience in ML R&D. Her areas of interest include generative models, NLP and Human <> AI interactions. She was also a 2x startup founder, a Blockchain educator/researcher, Founder of Women Who Code - Data Science, and technical advisor to various startups and Di…

Présentation de la version beta du Code Interpreter et démonstration de ses capacités, notamment: calculs mathématiques (algèbre, trigonométrie, statistiques), manipulation et analyse de données, visualisation de données, exécution de scripts Python, entraînement et évaluation de modèles d'apprentissage automatique, et traitement du texte et du langage naturel (tokenisation, stemming, fréquence des mots, etc.). Notez que l’outil est restreint par des règles de sécurité (pas d’accès à Internet ni de requêtes à des API externes ou téléchargement de fichiers via le web).

In this episode, Marijn Markus & I talked about what it is ACTUALLY like being in the data field, how he transitioned from social science to data science, & discuss some real-world data use cases.

Marijn also shared his experience managing data teams, what makes a good junior hire, why you might not need machine learning, & much, much more.

You don’t want to miss this episode!

📊 Come to my next free “How to Land Your First Data Job” training

🏫 Check out my 10-week data analytics bootcamp

Connect with Marijn: https://www.linkedin.com/in/marijnmarkus/

Timestamps:

(4:40) - Diverse backgrounds matter for programming and stats! (17:42) - Model explanations matter for business (21:00)- The real reason why generative AI became big in December (25:24) - Focus on what you know, not what you don't know (30:14) - You can't control the interview session? Think again!

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

O cientista brasileiro Jonatas Grosman alcançou o topo do ranking de modelos mais baixados do #HuggingFace, superando até mesmo o BERT da #Google. O modelo do Jonatas é um fine-tunning do modelo Wav2Vec2-XLSR-53 do Facebook, que faz reconhecimento de fala em inglês.

E atendendo o pedido da comunidade Data Hackers — a maior comunidade de AI e Data Science do Brasil, em um papo muito divertido, conheçam o Jonatas Grosman — Doutor em Ciência da Computação e Pesquisador na PUC-Rio; que conta neste episódio sobre sua história, como foi a construção do modelo, e ele que também traz a provocação para criarmos a conexão entre universidade, pesquisa e mercado.

Lembrando que você pode encontrar todos os podcasts da família Data Hackers no Spotify, iTunes, Google Podcast, Castbox e muitas outras plataformas. Caso queira, você também pode ouvir o episódio aqui no post mesmo!

Link do Medium: https://medium.com/data-hackers/o-brasileiro-com-a-ia-mais-baixada-do-mundo-data-hackers-podcast-70-e13a8c66fbcd

Hah .. o Jonatas pediu pra avisar a comunidade Data Hackers que o Departamento de Informática da PUC-Rio( http://www.inf.puc-rio.br), está com as inscrições pro Mestrado e Doutorado abertas, vão até dia 16/07. 😉

Conheça nossos convidados:

Jonatas Grosman — Doutor em Ciência da Computação e Pesquisador na PUC-Rio (https://www.linkedin.com/in/jonatasgrosman/)

Bancada Data Hackers:

Gabriel Lages  Allan Sene  Paulo Vasconcellos  Monique Femme 

Falamos no episódioLinks de referências:

Participe do Challenge'23 do Data Hackers: https://www.kaggle.com/datasets/datahackers/state-of-data-2022/discussion/415994 Hugging Face do Jonatas: https://huggingface.co/jonatasgrosman Departamento de Informática da PUC-Rio: http://www.inf.puc-rio.br (inscrições pro Mestrado e Doutorado estão abertas, vão até dia 16/07 😉) Laboratório ExACTa: https://exacta.inf.puc-rio.br (esse é o Lab coordenado pelo orientador do Jonatas que tenta juntar mercado e universidade) Github do Jonatas: https://github.com/jonatasgrosman Relógio atomico: http://www.cepa.if.usp.br/e-fisica/mecanica/pesquisahoje/cap3/defaultframebaixo.htm Supercomputador do LNCC: https://www.gov.br/lncc/pt-br/supercomputador-santos-dumont Video do Supercomputador LNCC: https://www.youtube.com/watch?v=nN6v0ExmQD4 Zelador desativa ‘alarme irritante’ e universidade dos EUA perde 20 anos de pesquisa científica: https://gq.globo.com/noticias/noticia/2023/06/zelador-desliga-alarme-irritante-com-pesquisa-cientifica-e-universidade-perde-20-anos-de-pesquisa.ghtml

Learn Enough Python to Be Dangerous: Software Development, Flask Web Apps, and Beginning Data Science with Python

All You Need to Know, and Nothing You Don't, to Solve Real Problems with Python Python is one of the most popular programming languages in the world, used for everything from shell scripts to web development to data science. As a result, Python is a great language to learn, but you don't need to learn "everything" to get started, just how to use it efficiently to solve real problems. In Learn Enough Python to Be Dangerous, renowned instructor Michael Hartl teaches the specific concepts, skills, and approaches you need to be professionally productive. Even if you've never programmed before, Hartl helps you quickly build technical sophistication and master the lore you need to succeed. Hartl introduces Python both as a general-purpose language and as a specialist tool for web development and data science, presenting focused examples and exercises that help you internalize what matters, without wasting time on details pros don't care about. Soon, it'll be like you were born knowing this stuff--and you'll be suddenly, seriously dangerous. Learn enough about . . . Applying core Python concepts with the interactive interpreter and command line Writing object-oriented code with Python's native objects Developing and publishing self-contained Python packages Using elegant, powerful functional programming techniques, including Python comprehensions Building new objects, and extending them via Test-Driven Development (TDD) Leveraging Python's exceptional shell scripting capabilities Creating and deploying a full web app, using routes, layouts, templates, and forms Getting started with data-science tools for numerical computations, data visualization, data analysis, and machine learning Mastering concrete and informal skills every developer needs Michael Hartl's Learn Enough Series includes books and video courses that focus on the most important parts of each subject, so you don't have to learn everything to get started--you just have to learn enough to be dangerous and solve technical problems yourself. Like this book? Don't miss Michael Hartl's companion video tutorial, Learn Enough Python to Be Dangerous LiveLessons. Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.

Maddie Shang - OpenMined (Sr. AI Research Engineer)\n\nMaddie is a Sr. ML / Research Engineer in industry, published author and seasoned open-source AI leader, with 6+ years of experience in ML R&D. Her areas of interest include generative models, NLP and Human <> AI interactions. She was also a 2x startup founder, a Blockchain educator/researcher, Founder of Women Who Code - Data Science, and technical advisor to various startups and Di…

Dive Into Data Science

Dive into the exciting world of data science with this practical introduction. Packed with essential skills and useful examples, Dive Into Data Science will show you how to obtain, analyze, and visualize data so you can leverage its power to solve common business challenges. With only a basic understanding of Python and high school math, you’ll be able to effortlessly work through the book and start implementing data science in your day-to-day work. From improving a bike sharing company to extracting data from websites and creating recommendation systems, you’ll discover how to find and use data-driven solutions to make business decisions. Topics covered include conducting exploratory data analysis, running A/B tests, performing binary classification using logistic regression models, and using machine learning algorithms. You’ll also learn how to: •Forecast consumer demand •Optimize marketing campaigns •Reduce customer attrition •Predict website traffic •Build recommendation systems With this practical guide at your fingertips, harness the power of programming, mathematical theory, and good old common sense to find data-driven solutions that make a difference. Don’t wait; dive right in!

At Coinbase, Airflow is adopted by a wide range of applications, and used by nearly all the engineering and data science teams. In this session, we will share our journey in improving the productivity of Airflow users at Coinbase. The presentation will focus on three main topics: Monorepo based architecture: our approach of using a monorepo to simplify DAG development and enable developers from across the company to work more efficiently and collaboratively. Tailored testing environment: our tailored Airflow testing environments that cater to users of different profiles, helping them to test their code more efficiently and with greater confidence. AirAgent: our in-house solution for Airflow continuous deployment, which puts Airflow deployment in self-driving mode and supports deploying any code changes related to Airflow (DAGs, plugins, configurations, dependency changes, etc.) without downtime.

Data science and machine learning are at the heart of Faire’s industry-celebrated marketplace (a16z top-ranked marketplace) and drive powerful search, navigation, and risk functions which are powered by ML models that are trained on 3000+ features defined by our data scientists. Previously, defining, backfilling and maintaining feature lifecycle was error-prone. Having a framework built on top of Airflow has empowered them to maintain and deploy their changes independently. We will explore: How to leverage Airflow as a tool that can power ML training and extend it with a framework that powers feature store. Enabling data scientists to define new features and backfill them (common problem in the ML world) using dynamic DAGs. The talk will provide valuable insights into how Faire constructed a framework that builds datasets to train models. Plus, how empowering end-users with tools isn’t something to fear but frees up engineering teams to focus on strategic initiatives.

High-scale orchestration of genomic algorithms using Airflow workflows, AWS Elastic Container Service (ECS), and Docker. Genomic algorithms are highly demanding of CPU, RAM, and storage. Our data science team requires a platform to facilitate the development and validation of proprietary algorithms. The Data engineering team develops a research data platform that enables Data Scientists to publish docker images to AWS ECR and run them using Airflow DAGS that provision AWS’s ECS compute power of EC2 and Fargate. We will describe a research platform that allows our data science team to check their algorithms on ~1000 cases in parallel using airflow UI and dynamic DAG generation to utilize EC2 machines, auto-scaling groups, and ECS clusters across multiple AWS regions.

The ability to create DAGs programmatically opens up new possibilities for collaboration between Data Science and Data Engineering. Engineering and DevOPs are typically incentivized by stability whereas Data Science is typically incentivized by fast iteration and experimentation. With Airflow, it becomes possible for engineers to create tools that allow Data Scientists and Analysts to create robust no-code/low-code data pipelines for feature stores. We will discuss Airlow as a means of bridging the gap between data infrastructure and modeling iteration as well as examine how a Qbiz customer did just this by creating a tool which allows Data Scientists to build features, train models and measure performance, using cloud services, in parallel.

Geospatial Data Analytics on AWS

In "Geospatial Data Analytics on AWS," you will learn how to store, manage, and analyze geospatial data effectively using various AWS services. This book provides insight into building geospatial data lakes, leveraging AWS databases, and applying best practices to derive insights from spatial data in the cloud. What this Book will help me do Design and manage geospatial data lakes on AWS leveraging S3 and other storage solutions. Analyze geospatial data using AWS services such as Athena and Redshift. Utilize machine learning models for geospatial data processing and analytics using SageMaker. Visualize geospatial data through services like Amazon QuickSight and OpenStreetMap integration. Avoid common pitfalls when managing geospatial data in the cloud. Author(s) Scott Bateman, Janahan Gnanachandran, and Jeff DeMuth bring their extensive experience in cloud computing and geospatial analytics to this book. With backgrounds in cloud architecture, data science, and geospatial applications, they aim to make complex topics accessible. Their collaborative approach ensures readers can practically apply concepts to real-world challenges. Who is it for? This book is ideal for GIS and data professionals, including developers, analysts, and scientists. It suits readers with a basic understanding of geographical concepts but no prior AWS experience. If you're aiming to enhance your cloud-based geospatial data management and analytics skills, this is the guide for you.

The past decade has seen rapid development of Artificial Intelligence (AI) and Machine Learning (ML) across different industries and for a multitude of successful use cases. However, one key challenge many businesses face for larger-scale adoption of AI and ML is that their data is often not ready for AI/ML. Automated feature engineering is a technology that aims to address the fundamental challenges of data readiness for AI. In this talk, we will review automated feature engineering technology and discuss how data scientists can benefit from this technology to transform your data and enable AI applications.

In this episode, Avery took a unique approach by going live on LinkedIn and letting the audience ask any question they wanted.

Join us as we dive deep into Avery's insights and practical tips to help you navigate the world of data.

🏫 Check out my 10-week data analytics bootcamp

📊 Come to my next free “How to Land Your First Data Job” training

Timestamps:

(3:09) - What should I have on my portfolio?

(05:24) - How to build a project?

(06:11) - How to stay organized?

(07:08) - Starting w/ Data Science Projects

(16:00) - How to become a better storyteller?

(21:18) - What to do if you're not landing any interviews?

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Data for All

Do you know what happens to your personal data when you are browsing, buying, or using apps? Discover how your data is harvested and exploited, and what you can do to access, delete, and monetize it. Data for All empowers everyone—from tech experts to the general public—to control how third parties use personal data. Read this eye-opening book to learn: The types of data you generate with every action, every day Where your data is stored, who controls it, and how much money they make from it How you can manage access and monetization of your own data Restricting data access to only companies and organizations you want to support The history of how we think about data, and why that is changing The new data ecosystem being built right now for your benefit The data you generate every day is the lifeblood of many large companies—and they make billions of dollars using it. In Data for All, bestselling author John K. Thompson outlines how this one-sided data economy is about to undergo a dramatic change. Thompson pulls back the curtain to reveal the true nature of data ownership, and how you can turn your data from a revenue stream for companies into a financial asset for your benefit. About the Technology Do you know what happens to your personal data when you’re browsing and buying? New global laws are turning the tide on companies who make billions from your clicks, searches, and likes. This eye-opening book provides an inspiring vision of how you can take back control of the data you generate every day. About the Book Data for All gives you a step-by-step plan to transform your relationship with data and start earning a “data dividend”—hundreds or thousands of dollars paid out simply for your online activities. You’ll learn how to oversee who accesses your data, how much different types of data are worth, and how to keep private details private. What's Inside The types of data you generate with every action, every day How you can manage access and monetization of your own data The history of how we think about data, and why that is changing The new data ecosystem being built right now for your benefit About the Reader For anyone who is curious or concerned about how their data is used. No technical knowledge required. About the Author John K. Thompson is an international technology executive with over 37 years of experience in the fields of data, advanced analytics, and artificial intelligence. Quotes An honest, direct, pull-no-punches source on one of the most important personal issues of our time....I changed some of my own behaviors after reading the book, and I suggest you do so as well. You have more to lose than you may think. - From the Foreword by Thomas H. Davenport, author of Competing on Analytics and The AI Advantage A must-read for anyone interested in the future of data. It helped me understand the reasons behind the current data ecosystem and the laws that are shaping its future. A great resource for both professionals and individuals. I highly recommend it. - Ravit Jain, Founder & Host of The Ravit Show, Data Science Evangelist

Today I’m continuing my conversation with Nadiem von Heydebrand, CEO of Mindfuel. In the conclusion of this special 2-part episode, Nadiem and I discuss the role of a Data Product Manager in depth. Nadiem reveals which fields data product managers are currently coming from, and how a new data product manager with a non-technical background can set themselves up for success in this new role. He also walks through his portfolio approach to data product management, and how to prioritize use cases when taking on a data product management role. Toward the end, Nadiem also shares personal examples of how he’s employed these strategies, why he feels it’s so important for engineers to be able to see and understand the impact of their work, and best practices around developing a data product team. 

Highlights / Skip to:

Brian introduces Nadiem and gives context for why the conversation with Nadiem led to a two-part episode (00:35) Nadiem summarizes his thoughts on data product management and adds context on which fields he sees data product managers currently coming from (01:46) Nadiem’s take on whether job listings for data product manager roles still have too many technical requirements (04:27) Why some non-technical people fail when they transition to a data product manager role and the ways Nadiem feels they can bolster their chances of success (07:09) Brian and Nadiem talk about their views on functional data product team models and the process for developing a data product as a team (10:11) When Nadiem feels it makes sense to hire a data product manager and adopt a portfolio view of your data products (16:22) Nadiem’s view on how to prioritize projects as a new data product manager (19:48) Nadiem shares a story of when he took on an interim role as a head of data and how he employed the portfolio strategies he recommends (24:54) How Nadiem evaluates perceived usability of a data product when picking use cases (27:28) Nadiem explains why understanding go-to-market strategy is so critical as a data product manager (30:00) Brian and Nadiem discuss the importance of today’s engineering teams understanding the value and impact of their work (32:09) How Nadiem and his team came up with the idea to develop a SaaS product for data product managers (34:40)

Quotes from Today’s Episode “So, data product management [...] is a combination of different capabilities [...]  [including] product management, design, data science, and machine learning. We covered this in viability, desirability, feasibility, and datability. So, these are four dimensions [that] you combine [...] together to become a data product manager.” — Nadiem von Heydebrand (02:34)

“There is no education for data product management today, there’s no university degree. ... So, there’s nobody out there—from my perspective—who really has all the four dimensions from day one. It’s more like an evolution: you’re coming from one of the [parallel business] domains or from one of the [parallel business] fields and then you extend your skill set over time.” — Nadiem von Heydebrand (03:04)

“If a product manager has very good communication skills and is able to break down the needs in a proper way or in a good understandable way to its tech lead, or its engineering lead or data science lead, then I think it works out super well. If this bridge is missing, then it becomes a little bit tricky because then the distance between the product manager and the development team is too far.” – Nadiem von Heydebrand (09:10)

“I think every data leader out there has an Excel spreadsheet or a list of prioritized use cases or the most relevant use cases for the business strategy… You can think about this list as a portfolio. You know, some of these use cases are super valuable; some of these use cases maybe will not work out, and you have to identify those which are bringing real return on investment when you put effort in there.” – Nadiem von Heydebrand (19:01)

“I’m not a magician for data product management. I just focused on a very strategic view on my portfolio and tried to identify those cases and those data products where I can believe I can easily develop them, I have a high degree of adoption with my lines of business, and I can truly measure the added revenue and the impact.” – Nadiem von Heydebrand (26:31)

“As a true data product manager, from my point of view, you are someone who is empathetic for the lines of businesses, to understand what their underlying needs and what the problems are. At the same time, you are a business person. You try to optimize the portfolio for your own needs, because you have business goals coming from your leadership team, from your head of data, or even from the person above, the CTO, CIO, even CEO. So, you want to make sure that your value contribution is always transparent, and visible, measurable, tangible.” – Nadiem von Heydebrand (29:20)

“If we look into classical product management, I mean, the product manager has to understand how to market and how to go to the market. And it’s this exactly the same situation with data product managers within your organization. You are as successful as your product performs in the market. This is how you measure yourself as a data product manager. This is how you define success for yourself.” – Nadiem von Heydebrand (30:58)

Links Mindfuel: https://mindfuel.ai/ LinkedIn: https://www.linkedin.com/in/nadiemvh/ Delight Software - the SAAS tool for data product managers to manage their portfolio of data products: https://delight.mindfuel.ai

Promessas tecnológicas teoricamente foram feitas no campo da educação, que não foram cumpridas. E os avanços da Inteligência Artificial, gerou impactos significativos no meio do aprendizado. 

Para abordar este assunto nós do Data Hackers — a maior comunidade de AI e Data Science do Brasil -, selecionamos os maiores especialistas sobre o assunto, para falar neste episódio sobre os impactos AI na Educação - onde o grande desafio da Inteligência Artificial, nesta área, é proporcionar momentos mais humanos. 

E para este papo, chamamos o Guilherme Silveira, Chief Innovation Officer e Co-fundador da Alura; e o Jones Madruga, Senior Engineering Manager no Nubank e AI Teacher Fellowship na Sirius Education; que contam sobre suas visões dos avanços da Inteligência Artificial, no campo da educação. 

Lembrando que você pode encontrar todos os podcasts da família Data Hackers no Spotify, iTunes, Google Podcast, Castbox e muitas outras plataformas. Caso queira, você também pode ouvir o episódio aqui no post mesmo!

Falamos no episódio

Conheça nossos convidados:

Guilherme Silveira — Chief Innovation Officer e Co-fundador da Alura : https://www.linkedin.com/in/guilhermeazevedosilveira/ Jones Madruga — Senior Engineering Manager at Nubank and AI Teacher Fellowship at Sirius Education - https://www.linkedin.com/in/jonesmadruga/

Links de referência no Medium: https://medium.com/data-hackers/chatgpt-na-educa%C3%A7%C3%A3o-data-hackers-podcast-69-7d08a473d769