Data engineering with Python : work with massive datasets to design data models and automate data pipelines using Python

Crickard, Paul

Data engineering with Python : work with massive datasets to design data models and automate data pipelines using Python - Birmingham - Mumbai Packt Publishing 2020 - xii, 337p.

Published by Packt Publishing Limited, Birmingham, UK. Title page: Birmingham—Mumbai.

Preface -- Section 1. Building data pipelines: extract, transform, and load -- Ch. 1. What is data engineering? -- Ch. 2. Building our data engineering infrastructure -- Ch. 3. Reading and writing files -- Ch. 4. Working with databases -- Ch. 5. Cleaning, transforming, and enriching data -- Ch. 6. Building a 311 data pipeline -- Section 2. Deploying data pipelines in production -- Ch. 7. Features of a production pipeline -- Ch. 8. Version control with the NiFi registry -- Ch. 9. Monitoring data pipelines --Ch. 10. Building a production data pipeline -- Section 3. Beyond batch: real-time and streaming data -- Ch. 11. Building a custom NiFi processor -- Ch. 12. Streaming data with Apache NiFi -- Ch. 13. Streaming data with Apache Kafka -- Ch. 14. Data processing with Apache Spark -- Ch. 15. Real-time edge data with MiNiFi, Kafka, and Spark --Appendix and building a NiFi cluster -- Index.

Practical guide to data engineering using Python and open-source Apache technologies for building, deploying, and managing data pipelines. Three sections: (1) Building ETL pipelines — reading/writing files, relational and NoSQL databases, data cleaning and transformation, Apache NiFi pipeline; (2) Production deployment — NiFi registry version control, monitoring, staging, validation, failure handling; (3) Real-time and streaming data — Apache NiFi streaming, Apache Kafka (Python producers and consumers), Apache Spark and PySpark processing, real-time edge data with MiNiFi, Kafka and Spark. Appendix: building a distributed NiFi cluster. Code files on GitHub. Suitable for data engineers, data analysts, ETL developers, and IT professionals transitioning to data-driven roles.

9781839214189 183921418X


Computer program language
Python
Data mining.
Apache Kafka (Computer program)
Apache Spark (Electronic resource)
Real-time data processing.

005.133 CRI-D
Dr. Sanjeev, Librarian
Managed by: Dr. D. P. Tripathi, Deputy Librarian, Central Library
For any query / question, please mail at circulation.liby@nitj.ac.in 

Powered by Koha