CSV

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

2018-07-08 · Data Engineering Podcast Listen

podcast_episode

by Andy LoPresto , Kevin Doran , Tobias Macey

Agile/Scrum Airflow Flink API Chef Data Engineering Data Governance Data Management Dataflow DataOps DevOps Docker +13 more

Summary

Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what NiFi is? What is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code? How did you get involved with the project?

Where does it sit in the broader landscape of data tools?

Does the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)?

How do you manage versioning and backup of data flows, as well as promoting them between environments?

One of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed?

What types of reporting are available across this information?

What are some of the use cases or requirements that lend themselves well to being solved by NiFi?

When is NiFi the wrong choice?

What is involved in deploying and scaling a NiFi installation?

What are some of the system/network parameters that should be considered? What are the scaling limitations?

What have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community? What do you have planned for the future of NiFi?

Contact Info

Kevin Doran

@kevdoran on Twitter Email

Andy LoPresto

@yolopey on Twitter Email

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

NiFi HortonWorks DataFlow HortonWorks Apache Software Foundation Apple CSV XML JSON Perl Python Internet Scale Asset Management Documentum DataFlow NSA (National Security Agency) 24 (TV Show) Technology Transfer Program Agile Software Development Waterfall Spark Flink Kafka Oozie Luigi Airflow FluentD ETL (Extract, Transform, and Load) ESB (Enterprise Service Bus) MiNiFi Java C++ Provenance Kubernetes Apache Atlas Data Governance Kibana K-Nearest Neighbors DevOps DSL (Domain Specific Language) NiFi Registry Artifact Repository Nexus NiFi CLI Maven Archetype IoT Docker Backpressure NiFi Wiki TLS (Transport Layer Security) Mozilla TLS Observatory NiFi Flow Design System Data Lineage GDPR (General Data Protection Regulation)

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

An Introduction to SAS University Edition

2018-02-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Ron Cody

SAS analytics-platforms data data-science

SAS ® OnDemand for Academics is now the primary software choice for learners. SAS OnDemand for Academics is available for free access to SAS for individual learners as well as university educators and students. Access to SAS University Edition will end Aug. 2, 2021; users will no longer be able to download it after Apr. 30, 2021. Get up and running with the SAS University Edition using Ron Cody’s easy-to-follow, step-by-step guide. Aimed at beginners who have downloaded the free SAS University Edition and want to either use the point-and-click interactive environment of SAS Studio, or who want to write their own SAS programs, or both, An Introduction to SAS University Edition, begins by showing you how to obtain the SAS University Edition, and how you can run SAS on a PC or Macintosh computer. The first part of the book shows you how to perform basic tasks, such as producing a report, summarizing data, producing charts and graphs, and using the SAS Studio built-in tasks. The first part also describes how you can perform basic statistical tests using the interactive point-and-click environment. The second part of the book shows you how to write your own SAS programs, and how to use SAS procedures to perform a variety of tasks. This part of the book also explains how to read data from a variety of sources: text files, Excel workbooks, and CSV files. In order to get familiar with the SAS Studio environment, this book also shows you how to access dozens of interesting data sets that are included with the product.

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

2017-11-22 · Data Engineering Podcast Listen

podcast_episode

by Doug Cutting , Julien Le Dem (Astronomer) , Tobias Macey

Arrow Avro CI/CD Data Engineering Data Management GitHub Hadoop Hive Linux Parquet Presto Spark +3 more

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems.

Interview

Introduction How did you first get involved in the area of data management? What are the main serialization formats used for data storage and analysis? What are the tradeoffs that are offered by the different formats? How have the different storage and analysis tools influenced the types of storage formats that are available? You’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort? Why is it important for data engineers to carefully consider the format in which they transfer their data between systems?

What are the switching costs involved in moving from one format to another after you have started using it in a production system?

What are some of the new or upcoming formats that you are each excited about? How do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity?

Contact Information

Doug:

cutting on GitHub Blog @cutting on Twitter

Julien

Email @J_ on Twitter Blog julienledem on GitHub

Links

Apache Avro Apache Parquet Apache Arrow Hadoop Apache Pig Xerox Parc Excite Nutch Vertica Dremel White Paper

Twitter Blog on Release of Parquet

CSV XML Hive Impala Presto Spark SQL Brotli ZStandard Apache Drill Trevni Apache Calcite

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Apache Spark 2.x for Java Developers

2017-07-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Sourav Gulati (Databricks) , Sumit Kumar

AI/ML Analytics API Big Data Java JSON Kafka Scala Spark SQL Data Streaming XML +3 more

Delve into mastering big data processing with 'Apache Spark 2.x for Java Developers.' This book provides a practical guide to implementing Apache Spark using the Java APIs, offering a unique opportunity for Java developers to leverage Spark's powerful framework without transitioning to Scala. What this Book will help me do Learn how to process data from formats like XML, JSON, CSV using Spark Core. Implement real-time analytics using Spark Streaming and third-party tools like Kafka. Understand data querying with Spark SQL and master SQL schema processing. Apply machine learning techniques with Spark MLlib to real-world scenarios. Explore graph processing and analytics using Spark GraphX. Author(s) None Kumar and None Gulati, experienced professionals in Java development and big data, bring their wealth of practical experience and passion for teaching to this book. With a clear and concise writing style, they aim to simplify Spark for Java developers, making big data approachable. Who is it for? This book is perfect for Java developers who are eager to expand their skillset into big data processing with Apache Spark. Whether you are a seasoned Spark user or first diving into big data concepts, this book meets you at your level. With practical examples and straightforward explanations, you can unlock the potential of Spark in real-world scenarios.

Preparing Data for Analysis with JMP

2017-05-01 · O'Reilly Data Science Books O'Reilly Amazon

book

by Robert Carver

HTML JSON SAS analytics-platforms data data-science jmp

Access and clean up data easily using JMP®! Data acquisition and preparation commonly consume approximately 75% of the effort and time of total data analysis. JMP provides many visual, intuitive, and even innovative data-preparation capabilities that enable you to make the most of your organization's data. Preparing Data for Analysis with JMP® is organized within a framework of statistical investigations and model-building and illustrates the new data-handling features in JMP, such as the Query Builder. Useful to students and programmers with little or no JMP experience, or those looking to learn the new data-management features and techniques, it uses a practical approach to getting started with plenty of examples. Using step-by-step demonstrations and screenshots, this book walks you through the most commonly used data-management techniques that also include lots of tips on how to avoid common problems. With this book, you will learn how to: Manage database operations using the JMP Query Builder Get data into JMP from other formats, such as Excel, csv, SAS, HTML, JSON, and the web Identify and avoid problems with the help of JMP’s visual and automated data-exploration tools Consolidate data from multiple sources with Query Builder for tables Deal with common issues and repairs that include the following tasks: reshaping tables (stack/unstack) managing missing data with techniques such as imputation and Principal Components Analysis cleaning and correcting dirty data computing new variables transforming variables for modelling reconciling time and date Subset and filter your data Save data tables for exchange with other platforms

Learning Pentaho CTools

2016-05-31 · O'Reilly Data Science Books O'Reilly Amazon

book

by Miguel Gaspar

Analytics Dashboard DataViz Git JavaScript JSON XML analytics-platforms data data-science pentaho

Learning Pentaho CTools is a comprehensive guide to building sophisticated and custom analytics dashboards using the powerful capabilities of Pentaho CTools. This book walks you through the process of creating interactive dashboards, integrating data sources, and applying data visualization best practices. You'll quickly gain the expertise needed to create impactful dashboards with ease. What this Book will help me do Master installing and configuring CTools for Pentaho to jumpstart dashboard development. Harness diverse data sources and deliver data in formats like CSV, JSON, and XML for customized analytics. Design and implement dynamic, visually stunning dashboards using Community Dashboard Framework (CDF). Deploy and integrate plugins, leverage widgets, and manage dashboards effectively with version control. Enhance interactivity by customizing dashboard components, charts, and filters to suit unique requirements. Author(s) None Gaspar, an expert in Pentaho and its tools, has been a Senior Consultant at Pentaho, where he gained in-depth experience crafting analytics solutions. He brings to this book his teaching passion and field expertise, combining theoretical insights with practical applications. His approachable style ensures readers can follow technical concepts effectively. Who is it for? This book is ideal for developers who are looking to enhance their understanding of Pentaho's CTools portfolio to build advanced dashboards. A working knowledge of JavaScript and CSS will enable readers to get the most out of this guide. Whether you aim to extend your analytics capabilities or learn the tools from scratch, this book bridges the gap between learning and application.

[MINI] Multiple Regression

2016-02-19 · Data Skeptic Listen

podcast_episode

by Kyle Polich

This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price. Unlike a typical episode of Data Skeptic, these show notes are not just supporting material, but are actually featured in the episode.

The site Redfin gratiously allows users to download a CSV of results they are viewing. Unfortunately, they limit this extract to 500 listings, but you can still use it to try the same approach on your own using the download link shown in the figure below.

Data Science at the Command Line

2014-10-02 · O'Reilly Data Science Books O'Reilly Amazon

book

by Jeroen Janssens

Agile/Scrum API Data Science HTML JSON Linux Python XML data data-science

This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data. To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools. Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line. Obtain data from websites, APIs, databases, and spreadsheets Perform scrub operations on plain text, CSV, HTML/XML, and JSON Explore data, compute descriptive statistics, and create visualizations Manage your data science workflow using Drake Create reusable tools from one-liners and existing Python or R code Parallelize and distribute data-intensive pipelines using GNU Parallel Model data with dimensionality reduction, clustering, regression, and classification algorithms

PROC DOCUMENT by Example Using SAS

2013-10-31 · O'Reilly Data Science Books O'Reilly Amazon

book

by Michael Tuchman

HTML SAS analytics-platforms data data-science

PROC DOCUMENT by Example Using SAS demonstrates the practical uses of the DOCUMENT procedure, a part of the Output Delivery System, in SAS 9.3. Michael Tuchman explains how to work with PROC DOCUMENT, which is designed to store your SAS procedure output for replay at a later time without having to rerun your original SAS code. You&#226;&#8364;&#8482;ll learn how to:

save a collection of procedure output, descriptive text, and supporting graphs that can be replayed as a single unit save output once and distribute that same output in a variety of ODS formats such as HTML, CSV, and PDF create custom reports by comparing output from the same procedure run at different points in time create a table of contents for your output modify the appearance of both textual and graphical ODS output even if the original data is no longer available or easily accessible manage your tabular and graphical output by using descriptive labels, titles, and footnotes rearrange the original order of output in a procedure to suit your needs

After using this book, youâ€™ll be able to quickly and easily create libraries of professional-looking output that are accessible at any time.

This book is part of the SAS Press program.

SAS Server Pages

2013-03-18 · O'Reilly Data Science Books O'Reilly Amazon

book

by Don Henderson

BI HTML SAS XML analytics-platforms data data-science

SAS Server Pages have been used by SAS developers as a way of creating custom user interfaces for Web-based applications. This enhanced book offers information on how to create SAS Server Pages using the SAS 9.3 experimental procedure PROC STREAM, providing users with a foundation technology that greatly expands the capabilities of SAS for dynamic and rich content generation. By combining PROC STREAM and the Macro facility, SAS can now more easily generate any type of markup or text-based content such as HTML, XML, and CSV.

Exclusively available in electronic format, this book provides more extensive and flexible ways to develop applications using video examples of a wide range of PROC STREAM and SAS Server Pages techniques, including both Web applications and Base SAS implementations. Users can see results immediately and can access additional content and information online through embedded links. It also offers basic how-to documentation on PROC STREAM and an overview of a Portal Reporting Framework that illustrates creating custom user interfaces for stored processes within the SAS Portal.

Ideal for SAS programmers who have some knowledge of the Macro facility as well as BI users, SAS Server Pages: Generating Dynamic Content removes the difficulties associated with HTML-based content creation while providing a resource on using PROC STREAM in a dynamic, enhanced format.

SQL Server 2012 Data Integration Recipes: Solutions for Integration Services and Other ETL Tools

2012-11-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Adam Aspin

ETL/ELT Microsoft MySQL Oracle SQL SSIS XML data data-engineering microsoft-sql-server relational-databases

SQL Server 2012 Data Integration Recipes provides focused and practical solutions to real world problems of data integration. Need to import data into SQL Server from an outside source? Need to export data and send it to another system? SQL Server 2012 Data Integration Recipes has your back. You'll find solutions for importing from Microsoft Office data stores such as Excel and Access, from text files such as CSV files, from XML, from other database brands such as Oracle and MySQL, and even from other SQL Server databases. You'll learn techniques for managing metadata, transforming data to meet the needs of the target system, handling exceptions and errors, and much more. What DBA or developer isn't faced with the need to move data back and forth? Author Adam Aspin brings 10 years of extensive ETL experience involving SQL Server, and especially satellite products such as Data Transformation Services and SQL Server Integration Services. Extensive coverage is given to Integration Services, Microsoft's flagship tool for data integration in SQL Server environments. Coverage is also given to the broader range of tools such as OPENDATASOURCE, linked servers, OPENROWSET, Migration Assistant for Access, BCP Import, and BULK INSERT just to name a few. If you're looking for a resource to cover data integration and ETL across the gamut of Microsoft's SQL Server toolset, SQL Server 2012 Data Integration Recipes is the one book that will meet your needs. Provides practical and proven solutions towards creating resilient ETL environments Clearly answers the tough questions which professionals ask Goes beyond the tools to a thorough discussion of the underlying techniques Covers the gamut of data integration, beyond just SSIS Includes example databases and files to allow readers to test the recipes What you'll learn Import and export to and from CSV files, XML files, and other text-based sources. Move data between SQL databases, including SQL Server and others such as Oracle Database and MySQL. Discover and manage metadata held in various database systems. Remove duplicates and consolidate from multiple sources. Transform data to meet the needs of target systems. Profile source data as part of the discovery process. Log and manage errors and exceptions during an ETL process. Improve efficiency by detecting and processing only changed data. Who this book is for SQL Server 2012 Data Integration Recipes is written for developers wishing to find fast and reliable solutions for importing and exporting to and from SQL Server. The book appeals to DBAs as well, who are often tasked with implementing ETL processes. Developers and DBAs moving to SQL Server from other platforms will find the succinct, example-based approach ideal for quickly applying their general ETL knowledge to the specific tools provided as part of a SQL Server environment.

Using XML with Legacy Business Applications

2003-08-06 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Michael C. Rawlins

Java XML data data-engineering storage-formats

"This volume offers relentlessly pragmatic solutions to help your business applications get the most out of XML, with a breezy style that makes the going easy. Mike has lived this stuff; he has a strong command of the solutions and the philosophy that underlies them." --Eve Maler, XML Standards Architect, Sun Microsystems Businesses running legacy applications that do not support XML can face a tough choice: Either keep their legacy applications or switch to newer, XML-enhanced applications. XML presents both challenges and opportunities for organizations as they struggle with their data. Does this dilemma sound familiar? What if you could enable a legacy application to support XML? You can. In e-commerce expert Michael C. Rawlins outlines usable techniques for solving day-to-day XML-related data exchange problems. Using an easy-to-understand cookbook approach, Rawlins shows you how to build XML support into legacy business applications using Java and C++. The techniques are illustrated by building converters for legacy formats. Converting CSV files, flat files, and X12 EDI to and from XML will never be easier! Using XML with Legacy Business Applications, Inside you'll find: A concise tutorial for learning to read W3C XML schemas An introduction to using XSLT to transform between different XML formats Simple, pragmatic advice on transporting XML documents securely over the Internet For developers working with either MSXML with Visual C++ or Java and Xerces: See Chapter 3 for a step-by-step guide to enabling existing business applications to export XML documents See Chapter 2 for a step-by-step guide to enabling existing business applications to import XML documents See Chapter 5 for code examples and tips for validating XML documents against schemas See Chapter 12 for general tips on building commerce support into an application For end users who need a simple and robust conversion utility: See Chapter 7 for converting CSV files to and from XML See Chapter 8 for converting flat files to and from XML See Chapter 9 for converting X12 EDI to and from XML See Chapter 11 for tips on how to use these techniques together for complex format conversions The resource-filled companion Web site (www.rawlinsecconsulting.com/booksupplement) includes executable versions of the utilities described in the book, full source code in C++ and Java, XSLT stylesheets, bug fixes, sample input and output files, and more. 0321154940B07142003

Loading data into Cloud SQL

· Google Cloud Next '25

session

Cloud Computing Data Modelling GCP SQL

This hands-on lab guides you through importing real-world data from CSV files into a Cloud SQL database. Using a flight dataset from the US Bureau of Transport Statistics, you'll gain hands-on experience with data ingestion and basic analysis. You'll learn to create a Cloud SQL instance and database, effectively import your data, and build a foundational data model using SQL queries.

If you register for a Learning Center lab, please ensure that you sign up for a Google Cloud Skills Boost account for both your work domain and personal email address. You will need to authenticate your account as well (be sure to check your spam folder!). This will ensure you can arrive and access your labs quickly onsite. You can follow this link to sign up!

talk-data.com

Activity Trend

Top Events

Top Speakers

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

An Introduction to SAS University Edition

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Apache Spark 2.x for Java Developers

Preparing Data for Analysis with JMP

Learning Pentaho CTools

[MINI] Multiple Regression

Data Science at the Command Line

PROC DOCUMENT by Example Using SAS

SAS Server Pages

SQL Server 2012 Data Integration Recipes: Solutions for Integration Services and Other ETL Tools

Using XML with Legacy Business Applications

Loading data into Cloud SQL