Yet Another Language Learning Extension

Apache Spark Logo Apache Flink Logo

Big Data Scripts (repository)

Summary
💾 A collection of Apache Spark scripts used to get familiar with the basics of batch processing of big data and a collection of Apache Flink scripts used to get familiar with the basics of stream processing of big data.

Features

The Apache Spark scripts cover a range of topics such as:

manipulating RDDs via:
- functional programming principles like pattern matching
- regex
- functions like:
  - map
  - flatMap
  - reduceByKey
  - flatten
  - filter
manipulating DataFrames via:
- Spark SQL
- custom aggregation functions using Window

The Apache Flink scripts cover a range of topics such as:

basic manipulation of DataStreams via functions like:
- map
- filter
- flatMap
working with stateful streams via keyBy
dealing with infinite streams via:
- different kinds of window assigners like TumblingEventTimeWindows or SlidingEventTimeWindows
- keyed and non-keyed windows
- new ProcessWindowFunction

Tools

Purpose	Name
Programming language	Scala
Cluster computing framework	Apache Spark, Apache Flink

Installation Process

It is assumed that both a Java JDK and an IDE such as IntelliJ are installed and that the users operating system is Windows.

Install the Scala support plugin for your IDE.
Import the corresponding sub folder of this repository as a Maven project and resolve all dependencies.

Licence

These Big Data scripts are published under the MIT licence, which can be found in the LICENSE file. For this repository, the terms laid out there shall not apply to any individual that is currently enrolled at a higher education institution as a student. Those individuals shall not interact with any other part of this repository besides this README in any way by, for example cloning it or looking at its source code or have someone else interact with this repository in any way.

References

The Apache Spark logo was taken from Wikipedia and the Apache Flink logo from .