Git

How Panasonic Leverages Airflow

2024-07-01 · Airflow Summit 2024

session

by Michael Atondo

Airflow Analytics CI/CD DWH GitLab MySQL Plotly Redis Redshift Tableau

Using various operators to perform daily routines. Integration with Technologies: Redis: Acts as a caching mechanism to optimize data retrieval and processing speed, enhancing overall pipeline performance. MySQL: Utilized for storing metadata and managing task state information within Airflow’s backend database. Tableau: Integrates with Airflow to generate interactive visualizations and dashboards, providing valuable insights into the processed data. Amazon Redshift: Panasonic leverages Redshift for scalable data warehousing, seamlessly integrating it with Airflow for data loading and analytics. Foundry: Integrated with Airflow to access and process data stored within Foundry’s data platform, ensuring data consistency and reliability. Plotly Dashboards: Employed for creating custom, interactive web-based dashboards to visualize and analyze data processed through Airflow pipelines. GitLab CI/CD Pipelines: Utilized for version control and continuous integration/continuous deployment (CI/CD) of Airflow DAGs (Directed Acyclic Graphs), ensuring efficient development and deployment of workflows.

#51 Is Data Science a Lonely Profession?

2024-05-22 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

AI/ML Data Engineering Data Science GDPR/CCPA LLM SQL

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society.

Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style!

In this episode: Slack's Data Practices: Discussing Slack's use of customer data to build models, the risks of global data leakage, and the impact of GDPR and AI regulations.ChatGPT's Data Analysis Improvements: Discussing new features in ChatGPT that let you interrogate your data like a pro. The Loneliness of Data Scientists: Why being a lone data wolf is tough, and how collaboration is the key to success. Rustworkx for Graph Computation: Evaluating Rustworkx as a robust tool for graphs compared to Networkx.Dolt - Git for Data: Comparing Dolt and DVC as tools for data version control. Check it out.Veo by Google DeepMind: An overview of Google's Veo technology and its potential applications.Ilya Sutskever’s Departure from OpenAI: What does Ilya Sutskever’s exit mean for OpenAI with Jakub Pachocki stepping in?Hot Takes - No Data Engineering Roadmap? Debating the necessity of a data engineering roadmap and the prominence of SQL skills.

Building Telemetry Curations and Effective Data Pipelines

2024-04-11 · Data Universe 2024

Face To Face

by Isaac Obezo

Analytics CI/CD Data Engineering Data Lake Data Quality dbt

Have you ever wondered how a data company does data? In this session, Isaac Obezo, Staff Data Engineer at Starburst, will take you for a peek behind the curtain into Starburst’s own data architecture built to support batch processing of telemetry data within Galaxy data pipelines. Isaac will walk you through our architecture utilizing tools like git, dbt, and Starburst Galaxy to create a CI/CD process allowing our data engineering team to iterate quickly to deploy new models, develop and land data, and create and improve existing models in the data lake. Isaac will also discuss Starburst’s mentality toward data quality, the use of data products, and the process toward delivering quality analytics.

From Colab to Artifact Registry: Building python packages for production

2024-04-10 · Google Cloud Next '24

session

by Jason Davenport (Google Cloud)

Cloud Computing GCP Python

This session shows how someone can take a simple notebook they've created on Colab, and create an executable Python wheel that can be checked in for source control and to Artifact Registry for production use.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

Data As Code Using lakeFS Open Source

2024-04-10 · Data Universe 2024

Face To Face

by Iddo Avneri (Treeverse)

AI/ML ETL/ELT

Operating data lakes over object storage poses challenges: testing ETL changes, staging pipelines, ensuring best practices, debugging, and tracking data usage for ML reproducibility. Enter lakeFS—an open-source data version control tool transforming object storage into Git-like repositories. Learn how lakeFS enables unified workflows for code and data, providing benefits like faster development and error recovery. Join us to explore lakeFS and harness the power of data as code for your team's success.

Open Source Nessie: Enabling DataOps, Catalog Versioning and Git for Data

2024-04-10 · Data Universe 2024

Face To Face

by alex merced (Dremio)

Data Lakehouse DataOps

Project Nessie is an open-source project that provides a Git-like approach to version control for data lakehouse tables. This makes it possible to track data changes over time and revert to previous versions if necessary.

In a lakehouse environment, catalog versioning is essential for ensuring the accuracy and reliability of data. By tracking changes to the catalog, you can ensure that everyone is working with the same data version. This can help to prevent errors and inconsistencies.

Project Nessie can be used to implement catalog versioning in a lakehouse environment. This can be done by creating a Nessie repository for the catalog and then tracking changes to the repository using Git.

This presentation will discuss the benefits of using Project Nessie for catalog versioning in a lakehouse environment. We will also discuss how to implement catalog versioning using Project Nessie.

YAML all the way down: State machines and CI/CD for Google Apigee

2024-04-09 · Google Cloud Next '24

session

by Hugh Greenish (Marsh McLennan)

API CI/CD Cloud Computing GCP Terraform YAML

Marsh McLennan runs a complex Apigee Hybrid configuration, with 36 organizations operating in six global data centers. Keeping all of this in sync across production and nonproduction environments is a challenge. While the infrastructure itself is deployed with Terraform, Marsh McLennan wanted to apply the same declarative approach to the entire environment. See how it used Apigee's management APIs to build a state machine to keep the whole system running smoothly, allowing APIs to flow seamlessly from source control through to production.

Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.

Fundamentals of Analytics Engineering

2024-03-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Lasse Benninga (LaBenni Consulting) , Juan Manuel Perafan (Xebia) , Fanny Kassapian , Ricardo Angel Granados Lopez , Dumky de Wilde (MotherDuck) , Taís Laurindo Pereira , Jovan Gligorevic

Airbyte Analytics Analytics Engineering BigQuery CI/CD Data Engineering Data Modelling Data Quality dbt business-intelligence data data-science +1 more

Master the art and science of analytics engineering with 'Fundamentals of Analytics Engineering.' This book takes you on a comprehensive journey from understanding foundational concepts to implementing end-to-end analytics solutions. You'll gain not just theoretical knowledge but practical expertise in building scalable, robust data platforms to meet organizational needs. What this Book will help me do Design and implement effective data pipelines leveraging modern tools like Airbyte, BigQuery, and dbt. Adopt best practices for data modeling and schema design to enhance system performance and develop clearer data structures. Learn advanced techniques for ensuring data quality, governance, and observability in your data solutions. Master collaborative coding practices, including version control with Git and strategies for maintaining well-documented codebases. Automate and manage data workflows efficiently using CI/CD pipelines and workflow orchestrators. Author(s) Dumky De Wilde, alongside six co-authors-experienced professionals from various facets of the analytics field-delivers a cohesive exploration of analytics engineering. The authors blend their expertise in software development, data analysis, and engineering to offer actionable advice and insights. Their approachable ethos makes complex concepts understandable, promoting educational learning. Who is it for? This book is a perfect fit for data analysts and engineers curious about transitioning into analytics engineering. Aspiring professionals as well as seasoned analytics engineers looking to deepen their understanding of modern practices will find guidance. It's tailored for individuals aiming to boost their career trajectory in data engineering roles, addressing fundamental to advanced topics.

Version Your Data Lakehouse Like Your Software With Nessie

2024-03-10 · Data Engineering Podcast Listen

podcast_episode

by alex merced (Dremio) , Tobias Macey

AI/ML Analytics Cloud Computing Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Science Delta Dremio Hudi +4 more

Summary

Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, "Apache Iceberg, The definitive Guide", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg

Interview

Introduction How did you get involved in the area of data management? Can you describe what Nessie is and the story behind it? What are the core problems/complexities that Nessie is designed to solve? The closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case? Why would someone choose Nessie over native table-level branching in the Apache Iceberg spec? How do the versioning capabilities compare to/augment the data versioning in Iceberg? What are some of the sources of, and challenges in resolving, merge conflicts between table branches? Can you describe the architecture of Nessie? How have the design and goals of the project changed since it was first created? What is involved

Cracking the Data Science Interview

2024-02-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Aaren Stubberfield , Leondra R. Gonzalez

AI/ML Bash Data Science Python SQL data data-science

"Cracking the Data Science Interview" is your ultimate resource for preparing for roles in the competitive field of data science. With this book, you'll explore essential topics such as Python, SQL, statistics, and machine learning, as well as learn practical skills for building portfolios and acing interviews. Follow its guidance and you'll be equipped to stand out in any data science interview. What this Book will help me do Confidently explain complex statistical and machine learning concepts. Develop models and deploy them while ensuring version control and efficiency. Learn and apply scripting skills in shell and Bash for productivity. Master Git workflows to handle collaborative coding in projects. Perfectly tailor portfolios and resumes to land data science opportunities. Author(s) Leondra R. Gonzalez, with years of data science and mentorship experience, co-authors this book with None Stubberfield, a seasoned expert in technology and machine learning. Together, they integrate their expertise to provide practical advice for navigating the data science job market. Who is it for? If you're preparing for data science interviews, this book is for you. It's ideal for candidates with a foundational knowledge of Python, SQL, and statistics looking to refine and expand their technical and professional skills. Professionals transitioning into data science will also find it invaluable for building confidence and succeeding in this rewarding field.

Git reflog : ma commande (sous-côtée) préférée de Git

2023-12-12 · Meetup HumanTalks Paris @leboncoin

talk

AI/ML

L'une des commandes git que j'utilise le plus en tant que développeuse (après les classiques git commit, git push, git pull et git rebase) est la commande git reflog. J'ai découvert cette commande alors que j'avais probablement encore raté un rebase et perdu des modifications, ou bien malencontreusement utilisé git reset --hard après avoir remonté trop vite mon historique de commande... Dans ce talk je vous présente cette commande, trop peu connue à mon goût, ainsi que de nombreux use cases où elle m'a sauvée la mise !

On the benefits and virtues of drilling pilot holes - Coalesce 2023

2023-10-24 · dbt Coalesce 2023 Watch

video

by Leo Folsom (Datafold)

CI/CD Cloud Computing Datafold dbt

A significant proportion of dbt Cloud users do not have a dbt CI job set up. Among those who do, many don’t leverage powerful functionality like state comparison and deferral to implement Slim CI, likely causing teams to miss errors and building unnecessary tables. Setting up Slim CI in dbt Cloud can be especially challenging for larger-scale data organizations who have multiple data environments, git branches, and targets. Watch this session to learn how you can build and evolve a strong, lasting data environment using Slim CI.

Speakers: Leo Folsom, Solutions Engineer, Datafold

Register for Coalesce at https://coalesce.getdbt.com

Powering Vector Search With Real Time And Incremental Vector Indexes

2023-09-25 · Data Engineering Podcast Listen

podcast_episode

by Louis Brandy , Tobias Macey

AI/ML Analytics BI CI/CD Cloud Computing CSV Data Engineering Data Management Data Quality Datafold dbt LLM +7 more

Summary

The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Louis Brandy about building vector indexes in real-time for analytics and AI applications

Interview

Introduction How did you get involved in the area of data management? Can you describe what vector search is and how it differs from other search technologies?

What are the technical challenges related to providing vector search? What are the applications for vector search that merit the added complexity?

Vector databases have been gaining a lot of attention recently with the proliferation of LLM applicati

Building Linked Data Products With JSON-LD

2023-09-17 · Data Engineering Podcast Listen

podcast_episode

by Brian Platz , Tobias Macey

AI/ML Analytics BI CI/CD Cloud Computing CSV Data Engineering Data Management Data Quality Datafold dbt JSON +6 more

Summary

A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Brian Platz about using JSON-LD for building linked-data products

Interview

Introduction How did you get involved in the area of data management? Can you describe what the term "linked data product" means and some examples of when you might build one?

What is the overlap between knowledge graphs and "linked data products"?

What is JSON-LD?

What are the domains in which it is typically used? How does it assist in developing linked data products?

what are the characterist

The Future is Open: Data Streaming in an Omni-Cloud Reality

2023-07-27 · Databricks DATA + AI Summit 2023 Watch

video

by Christina Taylor

Analytics Cloud Computing Data Engineering Data Lake Data Lakehouse Databricks DWH FinOps Omni Spark Data Streaming Terraform

This session begins with data warehouse trivia and lessons learned from production implementations of multicloud data architecture. You will learn to design future-proof low latency data systems that focus on openness and interoperability. You will also gain a gentle introduction to Cloud FinOps principles that can help your organization reduce compute spend and increase efficiency.

Most enterprises today are multicloud. While an assortment of low-code connectors boasts the ability to make data available for analytics in real time, they post long-lasting challenges:

Inefficient EDW targets
Inability to evolve schema
Forbiddingly expensive data exports due to cloud and vendor lock-in

The alternative is an open data lake that unifies batch and streaming workloads. Bronze landing zones in open format eliminate the data extraction costs required by proprietary EDW. Apache Spark™ Structured Streaming provides a unified ingestion interface. Streaming triggers allow us to switch back and forth between batch and stream with one-line code changes. Streaming aggregation enables us to incrementally compute on data that arrives near each other.

Specific examples are given on how to use Autoloader to discover newly arrived data and ensure exactly once, incremental processing. How DLT can be configured effectively to further simplify streaming jobs and accelerate the development cycle. How to apply SWE best practices to Workflows and integrate with popular Git providers, either using the Databricks Project or Databricks Terraform provider.

Talk by: Christina Taylor

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building an ML Experimentation Platform for Easy Reproducibility | Treeverse

2023-05-11 · Data Council 2023 Watch

video

by Vino Duraisamy (lakeFS)

AI/ML Analytics Data Engineering Data Management MLOps NLP

ABOUT THE TALK: Quality ML at scale is only possible when we can reproduce a specific iteration of the ML experiment–and this is where data is key.

In this talk, you will learn how to use a data versioning engine to intuitively and easily version your ML experiments and reproduce any specific iteration of the experiment.

This talk will demo through a live code example: -Creating a basic ML experimentation framework with lakeFS (on Jupyter notebook) -Reproducing ML components from a specific iteration of an experiment Building intuitive, zero-maintenance experiments infrastructure -All with common data engineering stacks & open source tooling.

ABOUT THE SPEAKER: Vino Duraisamy is a developer advocate at lakeFS, an open-source platform that delivers git-like experience to object store based data lakes. She has previously worked at NetApp (on data management applications for NetApp data centers), on data teams of Nike and Apple, where she worked mainly on batch processing workloads as a data engineer, built custom NLP models as an ML engineer and even touched upon MLOps a bit for model deployments.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

Doing Software Engineering in Academia - Johanna Bayer

2023-01-13 · DataTalks.Club Listen

podcast_episode

by Johanna Bayer

Data Engineering GitHub HTML

We talked about:

Johanna’s background Open science course and reproducible papers Research software engineering Convincing a professor to work on software instead of papers The importance of reproducible analysis Why academia is behind on software engineering The problems with open science publishing in academia The importance of standard coding practices How Johanna got into research software engineering Effective ways of learning software engineering skills Providing data and analysis for your project Johanna’s initial experience with software engineering in a project Working with sensitive data and the nuances of publishing it How often Johanna does hackathons, open source, and freelancing Social media as a source of repos and Johanna’s favorite communities Contributing to Git repos Publishing in the open in academia vs industry Johanna’s book and resource recommendations Conclusion

Links:

The Society of Research Software Engineering, plus regional chapters: https://society-rse.org/ The RSE Association of Australia and New Zealand: https://rse-aunz.github.io/ Research Software Engineers (RSEs) The people behind research software: https://de-rse.org/en/index.html The software sustainability institute: https://www.software.ac.uk/ The Carpentries (beginner git and programming courses): https://carpentries.org/ The Turing Way Book of Reproducible Research: https://the-turing-way.netlify.app/welcome

Free data engineering course: https://github.com/DataTalksClub/data-engineering-zoomcamp

Join DataTalks.Club: https://datatalks.club/slack.html

Our events: https://datatalks.club/events.html

How Preset Integrates dbt with Apache Superset to Deliver on Headless BI & Surface Metrics

2022-10-25 · dbt Coalesce 2022 Watch

video

BI dbt GitHub Superset

At Preset, we offer a managed service for Apache Superset, the most popular open source business intelligence platform (by Github stars) in the world. We believe the future of BI is not only rooted in open source but also adopts the best ideas from the software development life cycle. To that end, we've created a workflow that enables you to manage Superset datasets, charts, and dashboards as code and we integrated dbt into our platform. In this talk, I'll showcase the speed and change management benefits that are enabled by this workflow of managing core BI assets using dbt and version control.

Check the slides here: https://docs.google.com/presentation/d/1SjbXOgJnuAnmu3B3cY1YAEOMZdARH72Siwneq2yRjfU/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

When analysts outnumber engineers 5 to 1: Our journey with dbt at M1

2022-10-25 · dbt Coalesce 2022 Watch

video

by Kelly Wachtel (M1)

Analytics Data Engineering dbt Kubernetes SQL Tableau Terraform

How do you train and enable 20 data analysts to use dbt Core in a short amount of time?

At M1, engineering and analytics are far apart on the org chart, but work hand-in-hand every day. M1 engineering has a culture that celebrates open source, where every data engineer is trained and empowered to work all the way down the infrastructure stack, using tools like Terraform and Kubernetes. The analytics team is comprised of strong SQL writers who use Tableau to create visualizations used company wide. When M1 knew they needed a tool like dbt for change management and data documentation generation, they had to figure out how to bridge the gap between engineering and analytics to enable analysts to contribute with minimal engineering intervention. Join Kelly Wachtel, a senior data engineer at M1, explain how they trained about 20 analysts to use git and dbt Core over the past year, and strengthened their collaboration between their data engineering and analytics teams.

Check the slides here: https://docs.google.com/presentation/d/1CWI97EMyLIz6tptLPKt4VuMjJzV_X3oO/edit?usp=sharing&ouid=110293204340061069659&rtpof=true&sd=true

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

The Book of Dash

2022-10-25 · O'Reilly Data Science Books O'Reilly Amazon

book

by Christian Mayer , Adam Schroeder , Ann Marie Ward

AI/ML Dashboard DataViz Pandas Plotly Python dashboards data data-science data-science-tasks data-visualization

A swift and practical introduction to building interactive data visualization apps in Python, known as dashboards. Youâ??ve seen dashboards before; think election result visualizations you can update in real time, or population maps you can filter by demographic. With the Python Dash library youâ??ll create analytic dashboards that present data in effective, usable, elegant ways in just a few lines of code. The book is fast-paced and caters to those entirely new to dashboards. It will talk you through the necessary software, then get straight into building the dashboards themselves. Youâ??ll learn the basic format of a Dash app by building a twitter analysis dashboard that maps the number of likes certain accounts gained over time. Youâ??ll build up skills through three more sophisticated projects. The first is a global analysis app that compares country data in three areas: the percentage of a population using the internet, percentage of parliament seats held by women, and CO2 emissions. Youâ??ll then build an investment portfolio dashboard, and an app that allows you to visualize and explore machine learning algorithms. In this book you will: â?¢Create and run your first Dash apps â?¢Use the pandas library to manipulate and analyze social media data â?¢Use Git to download and build on existing apps written by the pros â?¢Visualize machine learning models in your apps â?¢Create and manipulate statistical and scientific charts and maps using Plotly Dash combines several technologies to get you building dashboards quickly and efficiently. This book will do the same.

talk-data.com

Activity Trend

Top Events

Top Speakers

How Panasonic Leverages Airflow

#51 Is Data Science a Lonely Profession?

Building Telemetry Curations and Effective Data Pipelines

From Colab to Artifact Registry: Building python packages for production

Data As Code Using lakeFS Open Source

Open Source Nessie: Enabling DataOps, Catalog Versioning and Git for Data

YAML all the way down: State machines and CI/CD for Google Apigee

Fundamentals of Analytics Engineering

Version Your Data Lakehouse Like Your Software With Nessie

Cracking the Data Science Interview

Git reflog : ma commande (sous-côtée) préférée de Git

On the benefits and virtues of drilling pilot holes - Coalesce 2023

Powering Vector Search With Real Time And Incremental Vector Indexes

Building Linked Data Products With JSON-LD

The Future is Open: Data Streaming in an Omni-Cloud Reality

Building an ML Experimentation Platform for Easy Reproducibility | Treeverse

Doing Software Engineering in Academia - Johanna Bayer

How Preset Integrates dbt with Apache Superset to Deliver on Headless BI & Surface Metrics

When analysts outnumber engineers 5 to 1: Our journey with dbt at M1

The Book of Dash