talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

1217

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

1217 activities · Newest first

Leading in Analytics

A step-by-step guide for business leaders who need to manage successful big data projects Leading in Analytics: The Critical Tasks for Executives to Master in the Age of Big Data takes you through the entire process of guiding an analytics initiative from inception to execution. You’ll learn which aspects of the project to pay attention to, the right questions to ask, and how to keep the project team focused on its mission to produce relevant and valuable project. As an executive, you can’t control every aspect of the process. But if you focus on high-impact factors that you can control, you can ensure an effective outcome. This book describes those factors and offers practical insight on how to get them right. Drawn from best-practice research in the field of analytics, the Manageable Tasks described in this book are specific to the goal of implementing big data tools at an enterprise level. A dream team of analytics and business experts have contributed their knowledge to show you how to choose the right business problem to address, put together the right team, gather the right data, select the right tools, and execute your strategic plan to produce an actionable result. Become an analytics-savvy executive with this valuable book. Ensure the success of analytics initiatives, maximize ROI, and draw value from big data Learn to define success and failure in analytics and big data projects Set your organization up for analytics success by identifying problems that have big data solutions Bring together the people, the tools, and the strategies that are right for the job By learning to pay attention to critical tasks in every analytics project, non-technical executives and strategic planners can guide their organizations to measurable results.

Demystifying Data Vault with dbt - Coalesce 2023

In this session, Alex Higgs unveils the potential of Data Vault 2.0, an often overlooked but powerful data warehousing method. Discover how it offers scalability, agility, and flexibility to your data solutions.

Key Highlights: - Explore the origins and essence of Data Vault 2.0 - Learn how Data Vault 2.0 streamlines big data solutions for scalability. - See how it integrates with dbt via AutomateDV for faster time to value. - Understand how AutomateDV simplifies Data Vault 2.0 data warehouses, freeing data teams from intricate SQL.

Speaker: Alex Higgs, Senior Consultant Data Engineer, Datavault

Register for Coalesce at https://coalesce.getdbt.com

Delta Lake: Up and Running

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS. This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights. You'll learn how to: Use modern data management and data engineering techniques Understand how ACID transactions bring reliability to data lakes at scale Run streaming and batch jobs against your data lake concurrently Execute update, delete, and merge commands against your data lake Use time travel to roll back and examine previous data versions Build a streaming data quality pipeline following the medallion architecture

Já exploramos com o Grupo Boticário, assuntos desde como é trabalhar com dados, até mesmo, como fazem uso de Modern Data Stack. Agora, queremos saber como a IA está mudando a forma do trabalho de uma das empresas mais admiradas da America Latina, da Pesquisa State of Data Brazil.

Neste episódio do Data Hackers — a maior comunidade de AI e Data Science do Brasil-, conheçam esse time de especialistas : a Isabella Becker — DPO (Data Protection Officer); e o Bruno Gobbet — Senior Data Manager; ambos atuantes na área de dados do Grupo Boticário.

Lembrando que você pode encontrar todos os podcasts da comunidade Data Hackers no Spotify, iTunes, Google Podcast, Castbox e muitas outras plataformas. Caso queira, você também pode ouvir o episódio aqui no post mesmo!

Link no Medium: https://medium.com/data-hackers/como-ia-est%C3%A1-mudando-a-forma-do-grupo-botic%C3%A1rio-trabalhar-data-hackers-podcast-74-c45006b64d67

Falamos no episódio

Conheça nosso convidado:

Isabella Becker — DPO ( Data Protection Officer) Bruno Gobbet — Senior Data Manager

Bancada Data Hackers:

Paulo Vasconcellos Monique Femme

Links de referências:

GH TECH (Medium): https://medium.com/gbtech Data Hackers News ( noticias semanais sobre a área de dados, AI e tecnologia) — https://podcasters.spotify.com/pod/show/datahackers/episodes/Data-Hackers-News-1---Amazon-investe-US-4-bi-na-Anthropic--Microsoft-anuncia-Copilot-para-Windows-11--OpenAI-anuncia-DALL-E-3-e29r06f Série Netflix Coded Bias: https://www.netflix.com/br/title/81328723 Livro ( Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy): https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815

For the past few years, we've seen the importance of data literacy and why organizations must invest in a data-driven culture, mindset, and skillset. However, as generative AI tools like ChatGPT have risen to prominence in the past year, AI literacy has never been more important. But how do we begin to approach AI literacy? Is it an extension of data literacy, a complement, or a new paradigm altogether? How should you get started on your AI literacy ambitions?  Cindi Howson is the Chief Data Strategy Officer at ThoughtSpot and host of The Data Chief podcast. Cindi is a data analytics, AI, and BI thought leader and an expert with a flair for bridging business needs with technology. As Chief Data Strategy Officer at ThoughtSpot, she advises top clients on data strategy and best practices to become data-driven, speaks internationally on top trends such as AI ethics, and influences ThoughtSpot’s product strategy.

Cindi was previously a Gartner Research Vice President, the lead author for the data and analytics maturity model and analytics and BI Magic Quadrant, and a popular keynote speaker. She introduced new research in data and AI for good, NLP/BI Search, and augmented analytics, bringing both BI bake-offs and innovation panels to Gartner globally. She’s frequently quoted in MIT, Harvard Business Review, and Information Week. She is rated a top 12 influencer in big data and analytics by Analytics Insight, Onalytca, Solutions Review, and Humans of Data.

In the episode, Cindi and Adel discuss how generative AI accelerates an organization’s data literacy, how leaders can think beyond data literacy and start to think about AI literacy, the importance of responsible use of AI, how to best communicate the value of AI within your organization, what generative AI means for data teams, AI use-cases in the data space, the psychological barriers blocking AI adoption, and much more. 

Links Mentioned in the Show: The Data Chief Podcast  ThoughtSpot Sage  BloombergGPT  Radar: Data & AI Literacy Course: AI Ethics  Course: Generative AI Concepts Course: Implementing AI Solutions in Business 

Complex Spatial Data Science in the Boardroom | Katy Ashwin & Blair Freebairn | KFC UK & Geolytix

Where should we open 50 stores? Simple question right?

To answer it Katy Ashwin, Marketing Planning Analyst at KFC UK and Blair Freebairn, CEO of Geolytix, talk through complex spatial modelling, as well as mobility data derived interaction surfaces, spatial ML ensemble models by channel and store format, big data optimization to create opportunity heat surfaces, and much more.

Learn more about site selection : https://carto.com/solutions/site-selection

Last Night an H3 Saved my Life | T.Rains, M. Garrod, S. Devarajappa | BT Group

Hear from Tim Rains, Geospatial Data Scientist; Matthew Garrod, Data Scientist & Sachin Devarajappa, Big Data Specialist at BT Group and learn how they leverage the power of h3 to build scalable solutions and faster processing compared to spatial-sql.

To learn more about spatial indexes check out: https://carto.com/solutions/spatial-indexes

Delta-rs, Apache Arrow, Polars, WASM: Is Rust the Future of Analytics?

Rust is a unique language whose traits make it very appealing for data engineering. In this session, we'll walk through the different aspects of the language that make it such a good fit for big data processing including: how it improves performance and how it provides greater safety guarantees and compatibility with a wide range of existing tools that make it well positioned to become a major building block for the future of analytics.

We will also take a hands-on look through real code examples at a few emerging technologies built on top of Rust that utilize these capabilities, and learn how to apply them to our modern lakehouse architecture.

Talk by: Oz Katz

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Managing Data Encryption in Apache Spark™

Sensitive data sets can be encrypted directly by new Apache Spark™ versions (3.2 and higher). Setting several configuration parameters and DataFrame options will trigger the Apache Parquet modular encryption mechanism that protects select columns with column-specific keys. The upcoming Spark 3.4 version will also support uniform encryption, where all DataFrame columns are encrypted with the same key.

Spark data encryption is already leveraged by a number of companies to protect personal or business confidential data in their production environments. The main integration effort is focused on key access control and on building a Spark/Parquet plug-in code that can interact with company’s key management service (KMS).

In this session, we will briefly cover the basics of Spark/Parquet encryption usage, and dive into the details of encryption key management that will help in integrating this Spark data protection mechanism in your deployment. You will learn how to run a HelloWorld encryption sample, and how to extend it into a real world production code integrated with your organization’s KMS and access control policies. We will talk about the standard envelope encryption approach to big data protection, the performance-vs-security trade-offs between single and double envelope wrapping, internal and external key metadata storage. We will see a demo, and discuss the new features such as uniform encryption and two-tier management of encryption keys.

Talk by: Gidon Gershinsky

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

An API for Deep Learning Inferencing on Apache Spark™

Apache Spark is a popular distributed framework for big data processing. It is commonly used for ETL (extract, transform and load) across large datasets. Today, the transform stage can often include the application of deep learning models on the data. For example, common models can be used for classification of images, sentiment analysis of text, language translation, anomaly detection, and many other use cases. Applying these models within Spark can be done today with the combination of PySpark, Pandas_UDF, and a lot of glue code. Often, that glue code can be difficult to get right, because it requires expertise across multiple domains - deep learning frameworks, PySpark APIs, pandas_UDF internal behavior, and performance optimization.

In this session, we introduce a new, simplified API for deep learning inferencing on Spark, introduced in SPARK-40264 as a collaboration between NVIDIA and Databricks, which seeks to standardize and open source this glue code to make deep learning inference integrations easier for everyone. We discuss its design and demonstrate its usage across multiple deep learning frameworks and models.

Talk by: Lee Yang

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How the Texas Rangers Revolutionized Baseball Analytics with a Modern Data Lakehouse

Don't miss this session where we demonstrate how the Texas Rangers baseball team organized their predictive models by using MLflow and the MLRegistry inside Databricks. They started using Databricks as a simple solution to centralizing our development on the cloud. This helped lessen the issue of siloed development in our team, and allowed us to leverage the benefits of distributed cloud computing.

But we quickly found that Databricks was a perfect solution to another problem that we faced in our data engineering stack. Specifically, cost, complexity, and scalability issues hampered our data architecture development for years, and we decided we needed to modernize our stack by migrating to a lakehouse. With Databricks Lakehouse, ad-hoc-analytics, ETL operations, and MLOps all living within Databricks, development at scale has never been easier for our team.

Going forward, we hope to fully eliminate the silos of development, and remove the disconnect between our analytics and data engineering teams. From computer vision, pose analytics, and player tracking, to pitch design, base stealing likelihood, and more, come see how the Texas Rangers are using innovative cloud technologies to create action-driven reports from the current sea of big data.

Talk by: Alexander Booth and Oliver Dykstra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Future Data Access Control: Booz Allen Hamilton’s Way of Securing Databricks Lakehouse with Immuta

In this talk, I’ll review how we utilize Attribute-Based Access Control (ABAC) to enforce policy via Immuta. I’ll discuss the differences between the ABAC and legacy Role-Based Access Control (RBAC) approaches to control access and how the RBAC approach is not sufficient to keep up with today’s growing big data market. With so much data available, there also comes substantial risk. Data can contain many sensitive data elements, including PII and PHI. Industry leaders like Databricks are pushing the boundaries of data technology, which leads to constantly evolving data use cases. And that’s a good thing. However, the RBAC approach is struggling to keep up with those advancements.

So what is RBAC? It’s an approach to data access that permits system access based on the end-user’s role. For legacy systems, it’s meant as a simple but effective approach to securing data. Are you a manager? Then you’ll get access to data meant for managers. This is great for small deployments with clearly defined roles. Here at Booz Allen, we invested in Databricks because we have an environment of over 30 thousand users and billions of rows of data.

To mitigate this problem and align with our forward-thinking company standard, we introduced Immuta into our stack. Immuta uses ABAC to allow for dynamic data access control. Users are automatically assigned certain attributes, and access is based on those attributes instead of just their role. This allows for more flexibility and allows data access control to easily scale without the need to constantly map a user to their role. Using attributes, we can write policies in one place and have them applied across all our data platforms. This makes for a truly holistic data governance approach and provides immediate ROI and time savings for the company.

Talk by: Jeffrey Hess

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

AI for Big Data-Based Engineering Applications from Security Perspectives

This book emphasizes the idea of understanding the motivation of the advanced circuits’ design to establish the AI interface and to mitigate the security attacks in a better way for big data. It is for students, researchers, and professionals, faculty members and software developers who wish to carry out further research.

About 10 years ago, Thomas Davenport & DJ Patil published the article "Data Scientist: The Sexiest Job of the 21st Century" in the Harvard Business Review. In this piece, they described the bourgeoning role of the data scientist and what it will mean for organizations and individuals in the coming decade. As time has passed, data science has become increasingly institutionalized. Once seen as a luxury, it is now deemed a necessity in every modern boardroom. Moreover as technologies like AI and systems like ChatGPT keep astonishing us with their capabilities in handling data science tasks, it raises a pertinent question: Is Data Science Still the Sexiest Job of the 21st Century? In this episode, we invited Thomas Davenport on the show to share his perspective on where data science & AI are at today, and where they are headed. Thomas Davenport is the President’s Distinguished Professor of Information Technology and Management at Babson College, the co-founder of the International Institute for Analytics, a Fellow of the MIT Initiative for the Digital Economy, and a Senior Advisor to Deloitte Analytics. He has written or edited twenty books and over 250 print or digital articles for Harvard Business Review (HBR), Sloan Management Review, the Financial Times, and many other publications. One of HBR’s most frequently published authors, Thomas has been at the forefront of the Process Innovation, Knowledge Management, and Analytics and Big Data movements. He pioneered the concept of “competing on analytics” with his 2006 Harvard Business Review article and his 2007 book by the same name. Since then, he has continued to provide cutting-edge insights on how companies can use analytics and big data to their advantage, and then on artificial intelligence. Throughout the episode, we discuss how data science has changed since he first published his article, how it has become more institutionalized, how data leaders can drive value with data science, the importance of data culture, his views on AI and where he thinks its going, and a lot more. Links from the Show: Working with AI by Thomas Davenport The AI Advantage: How to Put the Artificial Intelligence Revolution to Work by Thomas Davenport Harvard Business Review New Vantage Partners CCC Intelligent Solutions Radar AI

The Road to Exceptional Data Correctness

ABOUT THE TALK: In this lightning talk, Emma Tang shares learnings from Stripe’s early efforts to tackle data correctness. As a financial technology company, data correctness is paramount to the operation of the company. This low tolerance for data inaccuracy poses unique constraints to how infrastructure is designed. Emma shares strategies as well as the trade-offs made in order to achieve this high level of correctness.

ABOUT THE SPEAKER: Emma Tang led Big Data Infrastructure at Stripe helping the company build and scale data infrastructure systems to support the 14x revenue growth and 6x headcount growth during her time there.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Data Contracts in the Modern Data Stack  | Whatnot

ABOUT THE TALK: After two years, three rounds of funding, and hundreds of new employees — Whatnot’s modern data stack has come from not existing to processing tens of millions of events across hundreds of different event types each day.

How does their small (but mighty!) team keep up? This talk explores data contracts — it covers the use of Interface Definition Language (Protobuf) to serve as the source of truth for event definitions, govern event construction in production, automatically generate DBT models in the data warehouse.

ABOUT THE SPEAKER: Zack Klein is a software engineer at Whatnot, where he thoroughly enjoys building data products and narrowly avoiding breaking production each day. Previously, he worked on big data platforms at Blackstone and HBO.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Big Data is Dead | MotherDuck

This talk will make the case that the era of Big Data is over. Now we can stop worrying about data size and focus on how we’re going to use it to make better decisions.

The data behind the graphs shown in this talk come from Jordan Tigani having analyzed query logs, deal post-mortems, benchmark results (published and unpublished), customer support tickets, customer conversations, service logs, and published blog posts, plus a bit of intuition.

ABOUT THE SPEAKER: Jordan Tigani is co-founder and chief duck-herder at MotherDuck, a startup building a serverless analytics platform based on DuckDB. He helped create Google BigQuery, wrote two books on it, and led first the engineering team then the product team through its first $1B or so in revenue.

👉 Sign up for our “No BS” Newsletter to get the latest technical data & AI content: https://datacouncil.ai/newsletter

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Building a Control Plane for Data | Acryl

ABOUT THE TALK: This talk explains what the control plane of data looks like and how it fits into the reference architecture for the deconstructed data stack: a data stack that includes operational data stores, streaming systems, transformation engines, BI tools, warehouses, ML tools and orchestrators.

It dives into the fundamental characteristics for a control plane:

Breadth (completeness) Latency (freshness) Scale Source of Truth Auditability

ABOUT THE SPEAKER: Shirshanka Das is the Co-founder and CEO of Acryl Data, the company which is commercializing the open source DataHub project, a real-time metadata platform used by LinkedIn, Stripe, Pinterest, Optum, Expedia and many others. Prior to founding Acryl, he was the overall architect for Big Data at LinkedIn from 2010 to 2020, and responsible for creating the metadata and data management strategy at the company.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai

IBM FlashSystem 7300 Product Guide

This IBM® Redpaper Product Guide describes the IBM FlashSystem® 7300 solution, which is a next-generation IBM FlashSystem control enclosure. It combines the performance of flash and a Non-Volatile Memory Express (NVMe)-optimized architecture with the reliability and innovation of IBM FlashCore® technology and the rich feature set and high availability (HA) of IBM Spectrum® Virtualize. To take advantage of artificial intelligence (AI)-enhanced applications, real-time big data analytics, and cloud architectures that require higher levels of system performance and storage capacity, enterprises around the globe are rapidly moving to modernize established IT infrastructures. However, for many organizations, staff resources, and expertise are limited, and cost-efficiency is a top priority. These organizations have important investments in existing infrastructure that they want to maximize. They need enterprise-grade solutions that optimize cost-efficiency while simplifying the pathway to modernization. IBM FlashSystem 7300 is designed specifically for these requirements and use cases. It also delivers a cyber resilience without compromising application performance. IBM FlashSystem 7300 provides a rich set of software-defined storage (SDS) features that are delivered by IBM Spectrum Virtualize, including the following examples: Data reduction and deduplication Dynamic tiering Thin-provisioning Snapshots Cloning Replication and data copy services Cyber resilience Transparent Cloud Tiering (TCT) IBM HyperSwap® including 3-site replication for high availability Scale-out and scale-up configurations further enhance capacity and throughput for better availability With the release of IBM Spectrum Virtualize V8.5, extra functions and features are available, including support for new third-generation IBM FlashCore Modules Non-Volatile Memory Express (NVMe) type drives within the control enclosure, and 100 Gbps Ethernet adapters that provide NVMe Remote Direct Memory Access (RDMA) options. New software features include GUI enhancements, security enhancements including multifactor authentication and single sign-on, and Fibre Channel (FC) portsets.