talk-data.com talk-data.com

Topic

Big Data

data_processing analytics large_datasets

1217

tagged

Activity Trend

28 peak/qtr
2020-Q1 2026-Q1

Activities

1217 activities · Newest first

Sink Framework Evolution in Apache Flink

Apache Flink is one of the most popular frameworks for unified stream and batch processing. Like every other big data framework, Apache Flink offers connectors to different external systems to read from and write to. We refer to connectors for writing to external systems as sinks. Over the years, multiple frameworks existed inside Apache Flink for building sinks. The Apache Flink community also noticed the latest trend of ingesting real-time data directly into data lakes for further usage. Therefore with Apache Flink 1.15, we released the next iteration of our sink framework. We designed it to accommodate the needs of modern data lake connectors i.e. lazy file compaction, user-defined shuffling.

In this talk, we first give a brief historical glimpse of the evolution of the frameworks that started as a kind of a simple map operation until a custom operator model that simplified two-phase commit semantics. Secondly, we do a deep dive into Apache Flink’s fault tolerance model to explain how the last iteration of the sink framework supports exactly-once processing and complex operations important for delta lakes. In summary, this talk introduces the principles behind the sink framework in Apache Flink and gives a starting point for developers building a new connector for Apache Flink.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data centric AI development  From Big Data to Good Data   Andrew Ng

Data-centric AI is a growing movement which shifts the engineering focus in AI systems from the model to the data. However, Data-centric AI faces many open challenges, including measuring data quality, data iteration and engineering data as part of the ML project workflow, data management tools, crowdsourcing, data augmentation & data synthesis as well as responsible AI. This talk names the key pillars of Data-centric AI, identifies the trends in Data-centric AI movement, and sets a vision for taking ideas applied intuitively by a handful of experts and synthesizing them into tools that make the application systematic for all.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark

In recent years, latest privacy laws & regulations bring a fundamental shift in the protection of data and privacy, placing new challenges to data applications. To resolve these privacy & security challenges in big data ecosystem without impacting existing applications, several hardware TEE (Trusted Execution Environment) solutions have been proposed for Apache Spark, e.g., PySpark with Scone and Opaque etc. However, to the best of our knowledge, none of them provide full protection to data pipelines in Spark applications. An adversary may still get sensitive information from unprotected components and stages. Furthermore, some of them greatly narrowed supported applications, e.g., only support SparkSQL. In this presentation, we will present a new PPMLA (privacy preserving machine learning and analytics) solution built on top of Apache Spark, BigDL, Occlum and Intel SGX. It ensures all spark components and pipelines are fully protected by Intel SGX, and existing Spark applications written in Scala, Java or Python can be migrated into our platform without any code change. We will demonstrate how to build distributed end-to-end SparkML/SparkSQL workloads with our solution on untrusted cloud environment and share real-world use cases for PPMLA.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Revolutionizing agriculture with AI: Delivering smart industrial solutions built upon a Lakehouse

John Deere is leveraging big data and AI to deliver ‘smart’ industrial solutions that are revolutionizing agriculture and construction, driving sustainability and ultimately helping to feed the world. The John Deere Data Factory that is built upon the Databricks Lakehouse Platform is at the core of this innovation. It ingests petabytes of data and trillions of records to give data teams fast, reliable access to standardized data sets supporting 100s of ML and analytics use cases across the organization. From IoT sensor-enabled equipment driving proactive alerts that prevent failures, to precision agriculture that maximizes field output, to optimizing operations in the supply chain, finance and marketing, John Deere is providing advanced products, technology and services for customers who cultivate, harvest, transform, enrich, and build upon the land.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ROAPI: Serve Not So Big Data Pipeline Outputs Online with Modern APIs

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Day 2 Morning Keynote |  Data + AI Summit 2022

Day 2 Morning Keynote | Data + AI Summit 2022 Production Machine Learning | Patrick Wendell MLflow 2.0 | Kasey Uhlenhuth Revolutionizing agriculture with AI: Delivering smart industrial solutions built upon a Lakehouse architecture | Ganesh Jayaram Intuit’s Data Journey to the Lakehouse: Developing Smart, Personalized Financial Products for 100M+ Consumers & Small Businesses | Alon Amit and Manish Amde Workflows | Stacy Kerkela Delta Live Tables | Michael Armbrust AI and creativity, and building data products where there's no quantitative metric for success, such as in games, or web-scale search, or content discovery | Hilary Mason What to Know about Data Science and Machine Learning in 2022 | Peter Norvig Data-centric AI development: From Big Data to Good Data | Andrew Ng

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Airflow is the almost de-facto standard job orchestration tool that is used in the production stage. But moving your job from the development stage in other tools to the production stage in Airflow is usually a big pain for lots of users. A major reason is due to the environment inconsistency between the development environment and the production environment. Apache Zeppelin is a web-based notebook that is integrated seamlessly with lots of popular big data engines, such as Spark, Flink, Hive, Presto and etc. So it is very suitable for the development stage. In this talk, I will talk about the seamless integration between Airflow & Zeppelin, so that you can develop your big data job in Zeppelin efficiently and move to Airflow easily without caring too much about issues caused by the environment inconsistency.

Data + AI Summit 2022 Keynote from John Deere: Revolutionizing agriculture with AI

Hear Ganesh Jayaram, CIO of John Deere, talk about how the company is leveraging big data and AI to deliver ‘smart’ industrial solutions that are revolutionizing agriculture, driving sustainability and ultimately helping to feed the world. The John Deere Data Factory that is built upon the Databricks Lakehouse Platform is at the core of this innovation. It ingests 8 petabytes of data and trillions of records to give data teams fast, reliable access to standardized data sets to deliver over 3000 ML and analytics use cases that democratize data across John Deere, to deliver a culture of empowerment where data is everybody's responsibility.

Visit the Data + AI Summit at https://databricks.com/dataaisummit/

Big Data Analytics and Machine Intelligence in Biomedical and Health Informatics

BIG DATA ANALYTICS AND MACHINE INTELLIGENCE IN BIOMEDICAL AND HEALTH INFORMATICS Provides coverage of developments and state-of-the-art methods in the broad and diversified data analytics field and applicable areas such as big data analytics, data mining, and machine intelligence in biomedical and health informatics. The novel applications of Big Data Analytics and machine intelligence in the biomedical and healthcare sector is an emerging field comprising computer science, medicine, biology, natural environmental engineering, and pattern recognition. Biomedical and health informatics is a new era that brings tremendous opportunities and challenges due to the plentifully available biomedical data and the aim is to ensure high-quality and efficient healthcare by analyzing the data. The 12 chapters in??Big Data Analytics and Machine Intelligence in Biomedical and Health Informatics??cover the latest advances and developments in health informatics, data mining, machine learning, and artificial intelligence. They have been organized with respect to the similarity of topics addressed, ranging from issues pertaining to the Internet of Things (IoT) for biomedical engineering and health informatics, computational intelligence for medical data processing, and Internet of Medical Things??(IoMT). New researchers and practitioners working in the field will benefit from reading the book as they can quickly ascertain the best performing methods and compare the different approaches. Audience Researchers and practitioners working in the fields of biomedicine, health informatics, big data analytics, Internet of Things, and machine learning.

Analytics for Retail: A Step-by-Step Guide to the Statistics Behind a Successful Retail Business

Examine select retail business scenarios to learn basic mathematics, as well as probability and statistics required to analyze big data. This book focuses on useful and imperative applied analytics needed to build a retail business and explains mathematical concepts essential for decision making and communication in retail business environments. Everyone is a buyer or seller of products these days whether through a physical department store, Amazon, or their own business website. This book is a step-by-step guide to understanding and managing the mechanics of markups, markdowns, and basic statistics, math and computers that will help in your retail business. You'll tackle what to do with data once it is has accumulated and see how to arrange the data using descriptive statistics, primarily means, median, and mode, and then how to read the corresponding charts and graphs. Analytics for Retail is your path to creating visualrepresentations that powerfully communicate information and drive decisions. What You'll Learn Review standard statistical concepts to enhance your understanding of retail data Understand the concepts of markups, markdowns and profit margins, and probability Conduct an A/B testing email campaign with all the relevant analytics calculated and explained Who This Book Is For This is a primer book for anyone in the field of retail that needs to learn or refresh their skills or for a reader who wants to move in their company to a more analytical position.

Beginning Data Science in R 4: Data Analysis, Visualization, and Modelling for the Data Scientist

Discover best practices for data analysis and software development in R and start on the path to becoming a fully-fledged data scientist. Updated for the R 4.0 release, this book teaches you techniques for both data manipulation and visualization and shows you the best way for developing new software packages for R. Beginning Data Science in R 4, Second Edition details how data science is a combination of statistics, computational science, and machine learning. You’ll see how to efficiently structure and mine data to extract useful patterns and build mathematical models. This requires computational methods and programming, and R is an ideal programming language for this. Modern data analysis requires computational skills and usually a minimum of programming. After reading and using this book, you'll have what you need to get started with R programming with data science applications. Source code will be available to support your next projects as well. Source code is available at github.com/Apress/beg-data-science-r4. What You Will Learn Perform data science and analytics using statistics and the R programming language Visualize and explore data, including working with large data sets found in big data Build an R package Test and check your code Practice version control Profile and optimize your code Who This Book Is For Those with some data science or analytics background, but not necessarily experience with the R programming language.

Advanced Analytics with PySpark

The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark's Python API, and other best practices in Spark programming. Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection, to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing. If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis. Familiarize yourself with Spark's programming model and ecosystem Learn general approaches in data science Examine complete implementations that analyze large public datasets Discover which machine learning tools make sense for particular problems Explore code that can be adapted to many uses

Today’s data architecture discussions are heavily biased toward managing data for analytics, with attention to big data, scalability, cloud, and cross-platform data management. We need to acknowledge analytics bias and address management of operational data. Ignoring operational data architecture is a sure path to technical debt and future data management pain. Published at: https://www.eckerson.com/articles/the-yin-and-yang-of-data-architecture

Elasticsearch 8.x Cookbook - Fifth Edition

"Elasticsearch 8.x Cookbook" is your go-to resource for harnessing the full potential of Elasticsearch 8. This book provides over 180 hands-on recipes to help you efficiently implement, customize, and scale Elasticsearch solutions in your enterprise. Whether you're handling complex queries, analytics, or cluster management, you'll find practical insights to enhance your capabilities. What this Book will help me do Understand the advanced features of Elasticsearch 8.x, including X-Pack, for improving functionality and security. Master advanced indexing and query techniques to perform efficient and scalable data operations. Implement and manage Elasticsearch clusters effectively including monitoring performance via Kibana. Integrate Elasticsearch seamlessly into Java, Scala, Python, and big data environments. Develop custom plugins and extend Elasticsearch to meet unique project requirements. Author(s) Alberto Paro is a seasoned Elasticsearch expert with years of experience in search technologies and enterprise solution development. As a professional developer and consultant, he has worked with numerous organizations to implement Elasticsearch at scale. Alberto brings his deep technical knowledge and hands-on approach to this book, ensuring readers gain practical insights and skills. Who is it for? This book is perfect for software engineers, data professionals, and developers working with Elasticsearch in enterprise environments. If you're seeking to advance your Elasticsearch knowledge, enhance your query-writing abilities, or seek to integrate it into big data workflows, this book will be invaluable. Regardless of whether you're deploying Elasticsearch in e-commerce, applications, or for analytics, you'll find the content purposeful and engaging.

Bioinformatics and Medical Applications

BIOINFORMATICS AND MEDICAL APPLICATIONS The main topics addressed in this book are big data analytics problems in bioinformatics research such as microarray data analysis, sequence analysis, genomics-based analytics, disease network analysis, techniques for big data analytics, and health information technology. Bioinformatics and Medical Applications: Big Data Using Deep Learning Algorithms analyses massive biological datasets using computational approaches and the latest cutting-edge technologies to capture and interpret biological data. The book delivers various bioinformatics computational methods used to identify diseases at an early stage by assembling cutting-edge resources into a single collection designed to enlighten the reader on topics focusing on computer science, mathematics, and biology. In modern biology and medicine, bioinformatics is critical for data management. This book explains the bioinformatician’s important tools and examines how they are used to evaluate biological data and advance disease knowledge. The editors have curated a distinguished group of perceptive and concise chapters that presents the current state of medical treatments and systems and offers emerging solutions for a more personalized approach to healthcare. Applying deep learning techniques for data-driven solutions in health information allows automated analysis whose method can be more advantageous in supporting the problems arising from medical and health-related information. Audience The primary audience for the book includes specialists, researchers, postgraduates, designers, experts, and engineers, who are occupied with biometric research and security-related issues.

No episódio de hoje, Paulo Vasconcellos, Allan Sene e Gabriel Lages falam sobre um polêmico tema na indústria de dados: como o termo Big Data vai morrer. Essa conversa surgiu após uma entrevista com Andrew Ng, um dos pioneiros em AI no mundo, falando sobre darmos lugar a uma coleta de dados mais inteligente e ligada a um contexto.

Baixe o dataset do State of Data Brazil no Kaggle (https://www.kaggle.com/datasets/datahackers/state-of-data-2021)

Acesse o post no Medium para ter acesso as referências do episódio (https://www.kaggle.com/datasets/datahackers/state-of-data-2021)

Simplify Big Data Analytics with Amazon EMR

Simplify Big Data Analytics with Amazon EMR is a thorough guide to harnessing Amazon's EMR service for big data processing and analytics. From distributed computation pipelines to real-time streaming analytics, this book provides hands-on knowledge and actionable steps for implementing data solutions efficiently. What this Book will help me do Understand the architecture and key components of Amazon EMR and how to deploy it effectively. Learn to configure and manage distributed data processing pipelines using Amazon EMR. Implement security and data governance best practices within the Amazon EMR ecosystem. Master batch ETL and real-time analytics techniques using technologies like Apache Spark. Apply optimization and cost-saving strategies to scalable data solutions. Author(s) Sakti Mishra is a seasoned data professional with extensive expertise in deploying scalable analytics solutions on cloud platforms like AWS. With a background in big data technologies and a passion for teaching, Sakti ensures practical insights accompany every concept. Readers will find his approach thorough, hands-on, and highly informative. Who is it for? This book is perfect for data engineers, data scientists, and other professionals looking to leverage Amazon EMR for scalable analytics. If you are familiar with Python, Scala, or Java and have some exposure to Hadoop or AWS ecosystems, this book will empower you to design and implement robust data pipelines efficiently.

For Danielle Crop, the Chief Data Officer of Albertsons, to draw distinctions between “digital” and “data” only limits the ability of an organization to create useful products. One of the reasons I asked Danielle on the show is due to her background as a CDO and former SVP of digital at AMEX, where she also managed  product and design groups. My theory is that data leaders who have been exposed to the worlds of software product and UX design are prone to approach their data product work differently, and so that’s what we dug into this episode.   It didn’t take long for Danielle to share how she pushes her data science team to collaborate with business product managers for a “cross-functional, collaborative” end result. This also means getting the team to understand what their models are personalizing, and how customers experience the data products they use. In short, for her, it is about getting the data team to focus on “outcomes” vs “outputs.”

Scaling some of the data science and ML modeling work at Albertsons is a big challenge, and we talked about one of the big use cases she is trying to enable for customers, as well as one “real-life” non-digital experience that her team’s data science efforts are behind.

The big takeaway for me here was hearing how a CDO like Danielle is really putting customer experience and the company’s brand at the center of their data product work, as opposed solely focusing on ML model development, dashboard/BI creation, and seeing data as a raw ingredient that lives in a vacuum isolated from people.  

In this episode, we cover:

Danielle’s take on the “D” in CDO: is the distinction between “digital” and “data” even relevant, especially for a food and drug retailer? (01:25) The role of data product management and design in her org and how UX (i.e. shopper experience) is influenced by and considered in her team’s data science work (06:05) How Danielle’s team thinks about “customers” particularly in the context of internal stakeholders vs. grocery shoppers  (10:20) Danielle’s current and future plans for bringing her data team into stores to better understand shoppers and customers (11:11) How Danielle’s data team works with the digital shopper experience team (12:02)  “Outputs” versus “Outcomes”  for product managers, data science teams, and data products (16:30) Building customer loyalty, in-store personalization, and long term brand interaction with data science at Albertsons (20:40) How Danielle and her team at Albertsons measure the success of their data products (24:04) Finding the problems, building the solutions, and connecting the data to the non-technical side of the company (29:11)

Quotes from Today’s Episode “Data always comes from somewhere, right? It always has a source. And in our modern world, most of that source is some sort of digital software. So, to distinguish your data from its source is not very smart as a data scientist. You need to understand your data very well, where it came from, how it was developed, and software is a massive source of data. [As a CDO], I think it’s not important to distinguish between [data and digital]. It is important to distinguish between roles and responsibilities, you need different skills for these different areas, but to create an artificial silo between them doesn’t make a whole lot of sense to me.”- Danielle  (03:00)

“Product managers need to understand what the customer wants, what the business needs, how to pass that along to data scientists and data scientists, and to understand how that’s affecting business outcomes. That’s how I see this all working. And it depends on what type of models they’re customizing and building, right? Are they building personalization models that are going to be a digital asset? Are they building automation models that will go directly to some sort of operational activity in the store? What are they trying to solve?” - Danielle (06:30)

“In a company that sells products—groceries—to individuals, personalization is a huge opportunity. How do we make that experience, both in-digital and in-store, more relevant to the customer, more sticky and build loyalty with those customers? That’s the core problem, but underneath that is you got to build a lot of models that help personalize that experience. When you start talking about building a lot of different models, you need scale.”  - Danielle (9:24)

“[Customer interaction in the store] is a true big data problem, right, because you need to use the WiFi devices, et cetera. that you have in store that are pinging the devices at all times, and it’s a massive amount of data. Trying to weed through that and find the important signals that help us to actually drive that type of personalized experience is challenging. No one’s gotten there yet. I hope that we’ll be the first.” -  Danielle (19:50)

“I can imagine a checkout clerk who doesn’t want to talk to the customer, despite a data-driven suggestion appearing on the clerk’s monitor as to how to personalize a given customer interaction. The recommendation suggested to the clerk may be ‘accurate from a data science point of view, but if the clerk doesn’t actually act on it, then the data product didn’t provide any value. When I train people in my seminar, I try to get them thinking about that last mile. It may not be data science work, and maybe you have a big enough org where that clerk/customer experience is someone else’s responsibility, but being aware that this is a fault point and having a cross-team perspective is key.” - Brian @rhythmspice (24:50)

“We’re going through a moment in time in which trust in data is shaky. What I’d like people to understand and know on a broader philosophical level, is that in order to be able to understand data and use it to make decisions, you have to know its source. You have to understand its source. You have to understand the incentives around that source of data….you have to look at the data from the perspective of what it means and what the incentives were for creating it, and then analyze it, and then give an output. And fortunately, most statisticians, most data scientists, most people in most fields that I know, are incredibly motivated to be ethical and accurate in the information that they’re putting out.” - Danielle (34:15)