talk-data.com
People (46 results)
See all 46 →
Dr. Katrina Riehl
President of the Board of Directors at NumFOCUS; Head of the Streamlit Data Team at Snowflake; Adjunct Lecturer at Georgetown University · NumFOCUS; Snowflake; Georgetown University
Harel Shein
Sr. Eng. Manager, OpenLineage TSC · Datadog
Scott Mabe
Technical Advocate · Datadog
Companies (1 result)
Activities & events
| Title & Speakers | Event |
|---|---|
|
AWS x Datadog re:Invent Recap
2026-01-29 · 16:45
Welcome to the first Datadog User Group Meetup of the year! We’re kicking things off with an AWS re:Invent Recap on Thursday January 29th, bringing the community together to share insights, highlights and practical takeaways from the conference. This year’s agenda also puts a strong spotlight on AI and how it is rapidly reshaping DevOps, operations and on-call workflows, setting the stage for an exciting year ahead. That's also why we've invited Prosus to the stage, leading the way in the AI field, focusing heavily on integrating artificial intelligence into its global e-commerce and lifestyle brands, as well as investing in the broader AI ecosystem. Looking forward to seeing you there and starting the year strong as a local Datadog & AWS community. Agenda 17:30 Walk-in, food and drinks (vegetarian options available) 18:00 Opening by Datadog User Group Organizers --- 18.10 Talk 1: AWS DevOps Agent, drive operational excellence with a frontier agent that resolves and proactively prevents incidents by Ioannis Moustakis (Solutions Architect Lead, AWS) 18.50 Talk 2: TBD by NAME (TITLE, Prosus) 19.30 Quick Break 19.40 Talk 3: Introducing Bits AI SRE, your AI on-call teammate by Ryan Earley (Sr. Enterprise Sales Engineer, Datadog) --- 20:20 Drinks and socializing 21:00 - 21:15 End Speaker Bio's: 1. Ioannis Moustakis\, Solutions Architecture Lead (AWS) Leading a team of Solutions Architects across 11 countries in Northern Europe, supporting mid-market and scale-ups to accelerate cloud and AI adoption, modernization, and measurable business outcomes on AWS. 2. NAME\, Title (Prosus) Speaker Bio.. 3. Ryan Earley\, Sr. Enterprise Sales Engineer (Datadog) Ryan has been with Datadog for over 4,5 years, supporting Enteprrise organizations in EMEA with the evaluation and adoption of AI Observability and Security practices. Privacy notice(s):
Other important notes:
|
AWS x Datadog re:Invent Recap
|
|
Yotam Bentov on Error Detecting and Error Correcting Codes
2024-10-29 · 22:30
We're pleased to present Yotam Bentov on Error Detecting and Error Correcting Codes (Read the paper) Bell Labs in the late 40s and 50s was the cradle in which modern computing was conceived. From the invention of the transistor to the conception of information theory, the intellectual environment of Bell Labs was critical in creating our present-day technological world. Richard W. Hamming was steeped in this atmosphere when he came up with a still-used technique to for detecting and correcting errors in transmitted data. The paper introduces a technique for identifying the location of an error inside a transmitted "code word." The paper then proves that this technique yields the "most efficient" possible coding scheme. We will use this paper as both a departure point for understanding error correction and detection as well as the environment in which Hamming was working. We will close by ruminating on what the state of computing in 1950 can tell us about the future of computing more broadly. Yotam Bentov is an engineering manager for Tulip's Platform team, where he works on distributed systems, databases, and product type systems. He gets particularly excited about the history of computing, home-cooking, and bicycles. --- ⚠️ Required: You must have your real name on your account and provide a photo ID at the entrance to attend, per the venue rules. If you are not on the list, you will not be admitted. 🚔 Reminder: Papers We Love has a code of conduct. Breaching the CoC is grounds to be ejected from the meetup at the organizers' discretion. 📹 The event will be recorded and made available 1-2 weeks afterwards. 💬 Join us on the Papers We Love Discord - https://discord.gg/6gupsBg4qp Venue: Datadog 620 8th Ave, 45th Floor New York, NY 10018 USA Doors open at 6:30pm EST Note: Enter the building on 8th Avenue and go to the reception desk on your left ⚠️ Required: You must have your real name on your Meetup account and provide a photo ID at the entrance to attend, per the venue rules. If you are not on the list, you will not be admitted. |
Yotam Bentov on Error Detecting and Error Correcting Codes
|
|
Women Do Tech Too Conference
2024-10-02 · 12:00
Women Do Tech Too celebrates the vibrant and diverse women's tech community. It is driven by the belief that diversity and inclusion are paramount in fostering innovation and driving progress in the tech sector. This platform is not only for women but also welcomes participation from all individuals who share a passion for championing diversity and equity in the tech community. Seating and SecurityWe will have limited seats! Please remember this when replying to the RSVP and update your response if you cannot attend. Agenda2 PM - Welcoming notes: Self doubt and career progression by Cécile Chateau, Engineering Program Manager Director 2:15 PM - From Idea to Action: WomenInTech, a safe place to find inspiration, knowledge, and roles models by Alejandra Paredes, Sofware Developer Engineer, and Estelle Thou, Software Developer Engineer There will be three main consecutive tracks. Each track features three presentations followed by a Q&A session. 3 PM - Professional retraining, Successes & failures, Recognition at work Ensure a future of collaboration and diversity in the Tech Industry by Clara Philippot, Ada Tech School Paris Campus director “How are we training the new generation of developers to learn and iterate from collaboration, agile methodology and empathy” The path of staff engineer by Paola Ducolin, Staff Software Engineer (Datadog) “Earlier this year, I was promoted to Staff Engineer at my current company, Datadog. It was a three-year-long path. In this lightning talk, I will share the journey with its ups and downs.” EPM: Product or Engineering? by Agnès Masson-Sibut, Engineering Program Manager Have you ever wondered what EPM means and what do we do? If are mostly part of the Product organisation or Engineering organisation? Hopefully, everything will be clearer after this talk. ☕ 4 PM - Coffee break 4:15 PM - Privacy and Security Cookies 101 by Julie Chevrier, Software Developer Engineer “Have you ever wondered what happens after you click on a cookie consent banner and what the impact of your choice on the ads you see is? Join me to understand what is exactly a cookie and how it is used for advertising!“ How to make recommendations in a world without 3rd party cookies by Lucie Mader, Senior Machine Learning Engineer “Depending on the browser you're using and the website you're visiting, the products in the ads you see might seem strange. We'll discuss this issue and its possible relationship to third-party cookies in this talk.” Privacy in the age of Generative AI by Jaspreet Sandhu, Senior Machine Learning Engineer "With the advent and widespread integration of Generative AI across applications, industrial or personal, how do we prevent misuse and ensure data privacy, security, and ethical use? This talk delves into the challenges and strategies for safeguarding sensitive information and maintaining user trust in the evolving landscape of AI-driven technologies." 🚻 5:15 PM - Break 5:30 PM - User experience How to translate women’s empowerment into a brand visual identity by Camille Lannel-Lamotte, UI Designer Uncover how color theory, symbolism, and language come together to shape the new brand image and get an insider’s view of the key elements that define it. From Vision to Experience: The Product Manager's Journey in Shaping User-Centric Products by Salma, Senior Product Manager “Evolution of product managers' roles in creating user-centric products, transitioning from initial vision to crafting meaningful user experiences.” Crafting Consistency: Integrating a new theme in Criteo’s React Design System by Claire Dochez, Software Developer Engineer Last year, our team integrated a new theme into Criteo’s design system. This talk will cover the journey, emphasizing the key steps, challenges faced, and lessons learned along the way. 👋 6:30 PM - Closing notes Have a break and find YOUR own balance with the Wheel of Life! by Sandrine Planchon, Human-Minds - Coach in mental health prevention & Creator of disconnecting experiences When everything keeps getting faster, to the point of sometimes throwing you off balance, what about slowing down for a moment and reflecting on YOUR own need of balance in your life? The Wheel of Life can show a way to access it! 🍸 7 PM - Rooftop Cocktail (weather permitting) If you register for this event, you consent to CRITEO's use of your image, video, voice, or all three. In addition, you waive any right to inspect or approve the finished video recording. You agree that any such image, video, or audio recording and any reproduction thereof shall remain the property of the author and may be used by Criteo as it sees fit. You understand that this consent is perpetual, cannot be revoked by me, and is binding. You understand that these images may appear publicly on Criteo's website, social media accounts, and/or other marketing materials. |
Women Do Tech Too Conference
|
|
Life after migration: Day-2 operations with the LGTM Stack at Trade Republic
2024-07-11 · 19:15
Nikolay Andreev
– Platform Engineer
@ Trade Republic
Half a year ago, my team at Trade Republic fully migrated our observability stack from Datadog to LGTM (Loki, Grafana, Tempo, Mimir). Operations after migration are as important as the migration itself, involving ongoing challenges such as performance and scalability issues, bugs, and incidents. In this talk, I’ll share our experiences from the past six months, detailing the challenges we faced and the valuable lessons we learned while using Grafana tools. Join us to gain insights into the practical aspects of managing and optimising an observability stack in a dynamic environment. |
Grafana & Friends Berlin 🏝️ Summer Meetup at Trade Republic
|
|
Washington DC area dbt Meetup
2024-06-03 · 21:30
dbt Meetups are networking events open to all folks working with data! Talks predominantly focus on community members' experience with dbt, but presentations can cover broader topics such as analytics engineering, data stacks, data ops, modeling, testing, and team structures. 🤝Organizer: dbt Labs 🏠Venue Host: Excella, 2300 Wilson Boulevard, Suite 600, Arlington, VA 22201 🍕Catering: Light bites & beverages To attend, please read the Health and Safety Policy and Terms of Participation: https://www.getdbt.com/legal/health-and-safety-policy 🗣️ Presentations:
Event details: The doors open at 5:30pm for check-in and networking. Presentations begin at 6:00 and there will be additional networking time until 7:30. Refreshments will be provided. Directions: The Excella office is located near the Court House Metro station on the Orange line. Parking on premise at 2300 Wilson Blvd is extremely limited. If you are unable to park in the building's garage or on the street near the building, we suggest the following parking garages, which are within close walking distance:
➡️ Join the dbt Slack community: https://www.getdbt.com/community/ 🤝For the best Meetup experience, make sure to join the #local-dcmetro channel in dbt Slack (https://slack.getdbt.com/). ---------------------------------- dbt is a data transformation framework that lets analysts and engineers collaborate using their shared knowledge of SQL. Through the application of software engineering best practices like modularity, version control, testing, and documentation, dbt’s analytics engineering workflow helps teams work more efficiently to produce data the entire organization can trust. Learn more: https://www.getdbt.com/ |
Washington DC area dbt Meetup
|
|
Frontend Focus: Exploring AI and Innovation
2024-05-29 · 16:30
Dear members of the Criteo community, Join us as we dive into the latest research and advancements at the intersection of artificial intelligence (AI), user interface (UI), and system design. Our amazing speakers will illuminate the evolving landscape where AI reshapes how we interact with front-end technologies. Welcome to an evening of exploration and insights into the future of AI-driven design! Agenda 6:30 PM - Welcome 7:00 PM - Introduction by Sam Lee, Fronted Tech Lead @Criteo Sam will kick off the event by sharing insights and discussing current front-end topics at Criteo. 7:15 PM - "Draw a Dashboard" by Daniel Flores, Software Development Engineer @Criteo Don't miss Daniel's captivating presentation on generating dashboards with AI. With the latest vision models and related AI-powered techniques, it’s even easier now to add object detection and pattern recognition features into different products, experiments like v0.dev and draw-a-ui are good examples that it’s possible to apply it now for UI generation. Daniel works at Criteo by creating dashboards; imagine being able to generate them by just drawing! it will save time, especially when bootstrapping new configurations, or it can be a cool feature to allow users to configure even more custom reports; In this talk, let’s explore the results of my experimentation on UI generation by using our own design system and data visualization framework to produce dashboards from simple drawings. 7:45 PM - Second Talk by Henry Lagarde from Datadog Henry Lagarde from Datadog will take the stage to present "Meet your new AI best friend: 🦜🔗LangChain". LangChain 🦜🔗 is much more than just a framework for developing language model-powered applications. It's a versatile tool that allows you to interact with your APIs or databases using your everyday language. We will explore all together how, with just a few lines of code, you can interact with any LLM Api and create you own and unique powered AI tool. During the presentation, we will cover:
8:15 PM - Rooftop Cocktail (weather permitting) If weather permits, join us for a refreshing cocktail on the rooftop. Enjoy the panoramic view of the city while continuing conversations and exchanging ideas with fellow participants. The event is open to all technology enthusiasts. Registration is free, but places are limited. Book yours now to guarantee your attendance on this unique day. Don't miss this opportunity to dive into the future of Generative AI with Criteo. Together, we're shaping the face of technological innovation. Important information for the day of the meetup Those registered for the event and not present 5 minutes before the start of the session, i.e. at 6:55 p.m., will have their places made available. Those not registered or on the waitlist will, therefore, only be able to attend if there are places available on site 5 minutes before the start of the session, i.e., at 6:55 p.m. The Criteotech meetups team. If you register for this event, you consent to CRITEO's use of your image, video, voice, or all three. In addition, you waive any right to inspect or approve the finished video recording. You agree that any such image, video, or audio recording and any reproduction thereof shall remain the property of the author and may be used by Criteo as it sees fit. You understand that this consent is perpetual, cannot be revoked by me, and is binding. You understand that these images may appear publicly on Criteo's website, social media accounts, and/or other marketing materials. |
Frontend Focus: Exploring AI and Innovation
|
|
Kaiser Fung: Exploring advanced histograms
2024-03-14 · 21:45
Abstract: The histogram is a fundamental statistical chart. The simplest histogram is easy to make and interpret. More advanced variations of the histogram pose surprising challenges. I will cover insights from my recent exploration of varying-width histograms, which revealed gaps in my own understanding of this deceptively simple chart form. Bio: Kaiser is the creator of Junk Charts, a leading blog on data visualization, as well as the author of two bestsellers on statistical thinking, Numbers Rule Your World and Numbersense. His commentary on statistics and data visualization has been featured in Harvard Business Review, The Daily Beast, American Scientist, Wired, FiveThirtyEight, Slate, Financial Times, and CNN. He was the founding director of the Master of Science in Applied Analytics at Columbia University. He leads the data science team at VERSES, a cognitive computing startup. This event will be hosted at the Datadog NYC office (45th floor), with refreshments provided. Doors open at 5:45. Attendees are asked to respect the meetup's Code of Conduct. |
Kaiser Fung: Exploring advanced histograms
|
|
PyLadies Paris Python Talks
2023-11-16 · 17:30
Dear PyLadies 💚🐍 Our next on-site event is coming on 16th November featuring three excellent speakers: 🌟 Anne-Marie Tousch (Datadog) talk title: Why am I doing this??? 🌟 Alina Tuholukova (GitGuardian) Talk title: No downtime migrations in Django 🌟 Sarah Diot-Girard (Owkin) Talk title: The science of debugging 🌟Agenda (preliminary) 18h30 - 18h45 Come and take your seat 18h45 - 19h00 Welcome by PyLadies Paris and GitGuardian 19h00 - 19h30 Talk by Anne-Marie Tousch 19h30 - 20h00 Talk by Alina Tuholukova 20h00 - 20h30 Talk by Sarah Diot-Girard 20h30 - 21h30 Cocktail, networking NOTE: If want to join but cannot be with us onsite, we will broadcast the event online! Here is the registration link : https://app.livestorm.co/gitguardian/pyladies-paris-python-talks 🌟 Alina Tuholukova (GitGuardian) Talk title: No downtime migrations in Django Abstract: Database migrations are a critical aspect of maintaining and evolving Django applications. However, they often pose a significant challenge when it comes to maintaining uninterrupted service and ensuring a seamless user experience. Unfortunately, the django framework does not provide natively the solution for no-downtime migrations. In this talk, we will discuss the approach that GitGuardian took to ensure the continuous service for its users. About Alina: She began her career in research, where she spent a couple of years working on networking subjects. Later on, She switched to software development. she has worked for a few different companies, mainly on the backends in C++, Scala, and Python. For a little over a year now, She has been a part of the team at GitGuardian, a company that develops a code security platform and is a leader in detecting secrets within code. This role has exposed her to some challenging topics and provided many opportunities for learning. Currently, my team and I are actively involved in the development of honeytokens, a tool designed to detect whether an attacker has gained access to your code. 🌟 Sarah Diot-Girard (Owkin) Talk title: The science of debugging Abstract: Debugging might be the most universal experience shared by anyone who write code. Nevertheless, it is often a frustrating experience, perceived as abstruse and time-wasting, and where you have to come up with all the ideas. It does not have to be that way. This talk will focus on methods to help with making debugging a rational, positive experience, and we will explore how debugging can even help with gaining some valuable knowledge about your codebase. About Sarah: Sarah Diot-Girard has been working on Machine Learning since 2012, and she enjoys using data science tools to find solutions to practical problems. She is particularly interested in issues, both technical and ethical, coming from applying ML into real life. She gave talks at international conferences, about data privacy and algorithmic fairness, and software engineering best practices applied to data science. She is employed by Owkin as a maintainer of the Federated Learning platform Substra since 2023. 🌟 Anne-Marie Tousch (Datadog) talk title: Why am I doing this??? Abstract: How often do you ask yourself this question? In this talk, I’ll use it as a guide and walk you through a few interesting problems that we have at Datadog around anomaly detection in time series. We’ll see how this questioning can help us improve our understanding on a variety of topics such as when to use machine learning, how to select the best algorithm for a problem, when to publish a paper, or how to build useful products. About Anne-Marie: She is a Senior Data Scientist at Datadog, based in Paris. For some reason, she started working with machine learning in 2006. Before joining Datadog, she worked for 5+years as a machine learning researcher at Criteo, and before that, for 4+ years as a computer vision engineer in a startup. She holds a PhD in computer vision from the Ecole des Ponts ParisTech (2010). In the past few years at Datadog, she's been working on log anomaly detection, and more recently on general time series anomaly detection for observability. **GitGuardian** will be our host and sponsor of the food and the drinks during the networking session after the talks: thank you 💚 Important info 1:❗For safety reasons, the venue's staff will check everyone's identity on site. 📝Please remember to bring an ID with you and register for the event with your real name and family name. Thank you!2: Please be on time. We can’t guarantee a seat once the meetup has started# 🔍 FAQ Q. I'm not female, is it ok for me to attend? A. Yes, PyLadies Paris events are open to everyone at all levels. |
PyLadies Paris Python Talks
|
|
New York dbt meetup (IN-PERSON)
2023-11-02 · 22:30
😎It's here! Our next dbt meetup is coming up and will be hosted at the Sigma NYC office:
We are looking forward to socializing and seeing all of the new and familiar faces! 💡dbt Meetups are networking events open to all folks working with data! Talks predominantly focus on community members' experience with dbt, however, you'll catch presentations on broader topics such as analytics engineering, data stacks, data ops, modeling, testing, and team structures. Our venue has capacity limits, so please only RSVP if you intend to come and reach out to [email protected] if you need to cancel last minute or change your RSVP status on the Meetup to "Not Going." ➡️ Join the dbt Slack community: https://www.getdbt.com/community/ 🤝For the best Meetup experience, make sure to join the #local-nyc channel in dbt Slack (https://slack.getdbt.com/) dbt is a data transformation framework that lets analysts and engineers collaborate using their shared knowledge of SQL. Through the application of software engineering best practices like modularity, version control, testing, and documentation, dbt’s analytics engineering workflow helps teams work more efficiently to produce data the entire organization can trust.Learn more: https://www.getdbt.com/ *** 🏠Venue Host: Sigma Computing 🍕Catering: TBD 🤝Organizer: Brooklyn Data Co. 📣 Agenda and additional details below:
✨Speakers:
To attend, please read the Required Participation Language for In-Person Events with dbt Labs: https://bit.ly/3QIJXFb |
New York dbt meetup (IN-PERSON)
|
|
Moving Machine Learning Into The Data Pipeline at Cherre
2021-04-20 · 02:00
Tal Galfsky
– guest
@ Cherre
,
Tobias Macey
– host
Summary Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Tal Galfsky about how Cherre is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines. Interview Introduction How did you get involved in the area of data management? Started as physicist and evolved into Data Science Can you start by giving a brief recap of what Cherre is and the types of data that you deal with? Cherre is a company that connects data We’re not a data vendor, in that we don’t sell data, primarily We help companies connect and make sense of their data The real estate market is historically closed, gut let, behind on tech What are the biggest challenges that you deal with in your role when working with real estate data? Lack of a standard domain model in real estate. Ontology. What is a property? Each data source, thinks about properties in a very different way. Therefore, yielding similar, but completely different data. QUALITY (Even if the dataset are talking about the same thing, there are different levels of accuracy, freshness). HIREARCHY. When is one source better than another What are the teams and systems that rely on address information? Any company that needs to clean or organize (make sense) their data, need to identify, people, companies, and properties. Our clients use Address resolution in multiple ways. Via the UI or via an API. Our service is both external and internal so what I build has to be good enough for the demanding needs of our data science team, robust enough for our engineers, and simple enough that non-expert clients can use it. Can you give an example for the problems involved in entity resolution Known entity example. Empire state buidling. To resolve addresses in a way that makes sense for the client you need to capture the real world entities. Lots, buildings, units. Identify the type of the object (lot, building, unit) Tag the object with all the relevant addresses Relations to other objects (lot, building, unit) What are some examples of the kinds of edge cases or messiness that you encounter in addresses? First class is string problems. Second class component problems. third class is geocoding. I understand that you have developed a service for normalizing addresses and performing entity resolution to provide canonical references for downstream analyses. Can you give an overview of what is involved? What is the need for the service. The main requirement here is connecting an address to lot, building, unit with latitude and longitude coordinates How were you satisfying this requirement previously? Before we built our model and dedicated service we had a basic prototype for pipeline only to handle NYC addresses. What were the motivations for designing and implementing this as a service? Need to expand nationwide and to deal with client queries in real time. What are some of the other data sources that you rely on to be able to perform this normalization and resolution? Lot data, building data, unit data, Footprints and address points datasets. What challenges do you face in managing these other sources of information? Accuracy, hirearchy, standardization, unified solution, persistant ids and primary keys Digging into the specifics of your solution, can you talk through the full lifecycle of a request to resolve an address and the various manipulations that are performed on it? String cleaning, Parse and tokenize, standardize, Match What are some of the other pieces of information in your system that you would like to see addressed in a similar fashion? Our named entity solution with connection to knowledge graph and owner unmasking. What are some of the most interesting, unexpected, or challenging lessons that you learned while building this address resolution system? Scaling nyc geocode example. The NYC model was exploding a subset of the options for messing up an address. Flexibility. Dependencies. Client exposure. Now that you have this system running in production, if you were to start over today what would you do differently? a lot but at this point the module boundaries and client interface are defined in such way that we are able to make changes or completely replace any given part of it without breaking anything client facing What are some of the other projects that you are excited to work on going forward? Named entity resolution and Knowledge Graph Contact Info Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? BigQuery is huge asset and in particular UDFs but they don’t support API calls or python script Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Cherre Podcast Episode Photonics Knowledge Graph Entity Resolution BigQuery NLP == Natural Language Processing dbt Podcast Episode Airflow Podcast.init Episode Datadog Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast |
|
|
Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch
2021-01-19 · 02:00
Tejas Manohar
– guest
@ Hightouch
,
Tobias Macey
– host
Summary The data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and the workflow for enabling self-service access to your customer data by your marketing teams. This is an interesting conversation about the importance of the data warehouse and how it can be used beyond just internal analytics. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. This episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog. Your host is Tobias Macey and today I’m interviewing Tejas Manohar about Hightouch, a data platform that helps you sync your customer data from your data warehouse to your CRM, marketing, and support tools Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Hightouch and your motivation for creating it? What are the main points of friction for teams who are trying to make use of customer data? Where is Hightouch positioned in the ecosystem of customer data tools such as Segment, Mixpanel |
|
|
Kevin Stumpf
– guest
@ Tecton
,
Tobias Macey
– host
Summary As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts sustainable, the core capability they need is for data scientists and analysts to be able to build and deploy features in a self service manner. As a result the feature store is becoming a required piece of the data platform. To fill that need Kevin Stumpf and the team at Tecton are building an enterprise feature store as a service. In this episode he explains how his experience building the Michelanagelo platform at Uber has informed the design and architecture of Tecton, how it integrates with your existing data systems, and the elements that are required for well engineered feature store. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Kevin Stumpf about Tecton and the role that the feature store plays in a modern MLOps platform Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Tecton and your motivation for starting the business? For anyone who isn’t familiar with the concept, what is an example of a feature? How do you define what a feature store is? What role does a feature store play in the overall lifecycle of a machine learning p |
|
|
Low Friction Data Governance With Immuta
2020-12-21 · 23:00
Summary Data governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex aspects is that of access control to the data assets that an organization is responsible for managing. The team at Immuta has built a platform that aims to tackle that problem in a flexible and maintainable fashion so that data teams can easily integrate authorization, data masking, and privacy enhancing technologies into their data infrastructure. In this episode Steve Touw and Stephen Bailey share what they have built at Immuta, how it is implemented, and how it streamlines the workflow for everyone involved in working with sensitive data. If you are starting down the path of implementing a data governance strategy then this episode will provide a great overview of what is involved. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Feature flagging is a simple concept that enables you to ship faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new software with less risk, and release more often. ConfigCat is a feature flag service that lets you easily add flags to your Python code, and 9 other platforms. By adopting ConfigCat you and your manager can track and toggle your feature flags from their visual dashboard without redeploying any code or configuration, including granular targeting rules. You can roll out new features to a subset or your users for beta testing or canary deployments. With their simple API, clear documentation, and pricing that is independent of your team size you can get your first feature flags added in minutes without breaking the bank. Go to dataengineeringpodcast.com/configcat today to get 35% off any paid plan with code DATAENGINEERING or try out their free forever plan. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data inf |
|
|
Building A Self Service Data Platform For Alternative Data Analytics At YipitData
2020-12-15 · 01:00
Summary As a data engineer you’re familiar with the process of collecting data from databases, customer data platforms, APIs, etc. At YipitData they rely on a variety of alternative data sources to inform investment decisions by hedge funds and businesses. In this episode Andrew Gross, Bobby Muldoon, and Anup Segu describe the self service data platform that they have built to allow data analysts to own the end-to-end delivery of data projects and how that has allowed them to scale their output. They share the journey that they went through to build a scalable and maintainable system for web scraping, how to make it reliable and resilient to errors, and the lessons that they learned in the process. This was a great conversation about real world experiences in building a successful data-oriented business. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Andrew Gross, Bobby Muldoon, and Anup Segu about they are building pipelines at Yipit Data Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what YipitData does? What kinds of data sources and data assets are you working with? What is the composition of your data teams and how are they structured? Given the use of your data products in the financial sector how do you handle monitoring and alerting around data qualit |
|
|
Proven Patterns For Building Successful Data Teams
2020-12-07 · 23:45
Jesse Anderson
– guest
,
Tobias Macey
– host
Summary Building data products are complicated by the fact that there are so many different stakeholders with competing goals and priorities. It is also challenging because of the number of roles and capabilities that are necessary to go from idea to delivery. Different organizations have tried a multitude of organizational strategies to improve the success rate of these data teams with varying levels of success. In this episode Jesse Anderson shares the lessons that he has learned while working with dozens of businesses across industries to determine the team structures and communication styles that have generated the best results. If you are struggling to deliver value from big data, or just starting down the path of building the organizational capacity to turn raw information into valuable products then this is a conversation that you don’t want to miss. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Jesse Anderson about best practices for organizing and managing data teams Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of how you view the mission and responsibilities of a data team? What are the critical elements of a successful data team? Beyond the core pillars of data science, data engineering, and operations, what other specialized roles do you find hel |
|
|
Summary Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Einat Orr and Oz Katz about their work at Treeverse on the LakeFS system for versioning your data lakes the same way you version your code. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what LakeFS is and why you built it? There are a number of tools and platforms that support data virtualization and data versioning. How does LakeFS compare to the available options? (e.g. Alluxio, Denodo, Pachyderm, DVC, etc.) What are the primary use cases that LakeFS enables? For someone who wants to use LakeFS what is involved in getting it set up? How is LakeFS implemented? How has the design of the system changed or evolved since you began working on it? What assumptions did you have going into it which have since been invalidated or modified? How does the workflow for an engineer or analyst change from working directly against S3 to running against the LakeFS interface? How do you handle merge conflicts and resolution? What |
|
|
Cloud Native Data Security As Code With Cyral
2020-10-26 · 22:45
Manav Mital
– Founder & CEO
@ Cyral
,
Tobias Macey
– host
Summary One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other. In order to simplify the process of securing your data in the Cloud Manav Mital created Cyral to provide a way of enforcing security as code. In this episode he explains how the system is architected, how it can help you enforce compliance, and what is involved in getting it integrated with your existing systems. This was a good conversation about an aspect of data management that is too often left as an afterthought. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! |
|
|
Better Data Quality Through Observability With Monte Carlo
2020-10-19 · 23:00
Barr Moses
– CEO and co-founder
@ Monte Carlo
,
Tobias Macey
– host
,
Lior Gavish
– co-founder
@ Monte Carlo
Summary In order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines are healthy you need a way to make them observable. In this episode Barr Moses and Lior Gavish, co-founders of Monte Carlo, share the leading causes of what they refer to as data downtime and how it manifests. They also discuss methods for gaining visibility into the flow of data through your infrastructure, how to diagnose and prevent potential problems, and what they are building at Monte Carlo to help you maintain your data’s uptime. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about observability for your data pipelines and how they are addressing it at Monte Carlo. Interview Introduction How did you get involved in the area of data management? H |
|
|
Self Service Real Time Data Integration Without The Headaches With Meroxa
2020-10-05 · 23:00
Summary Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading. In this episode founders DeVaris Brown and Ali Hamidi explain how their tenure at Heroku informed their approach to making data integration self service, how the platform is architected, and how they have designed their system to adapt to the continued evolution of the data ecosystem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing DeVaris Brown and Ali Hamidi about Meroxa, a new platform as a service for dat |
|
|
Vadim Semenov
– Data Engineer
@ DataDog
,
Tobias Macey
– host
Summary DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog Interview Introduction How did you get involved in the area of data management? For anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with? What are the main components of your platform for managing that information? How are the data teams at DataDog organized and what are your primary responsibilities in the organization? What are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing? What are some of the strategies which have proven to be most useful in overcoming those challenges? Who are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met? Given that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information? Most of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered? What are some of the upcoming projects that you have planned for the upcoming months and years? What are some of the technologies, patterns, or practices that you are hoping to adopt? Contact Info LinkedIn @databuryat on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links DataDog Hadoop Hive Yarn Chef SRE == Site Reliability Engineer Application Performance Management (APM) Apache Kafka RocksDB Cassandra Apache Parquet data serialization format SLA == Service Level Agreement WatchDog Apache Spark Podcast Episode Apache Pig Databricks JVM == Java Virtual Machine Kubernetes SSIS (SQL Server Integration Services) Pentaho JasperSoft Apache Airflow Podcast.init Episode Apache NiFi Podcast Episode Luigi Dagster Podcast Episode Prefect The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast |
|