2 What’s Data Science and How Do I Do It?
Data Science is a multi-layered field in which the latest machine learning methods are only a tiny part. To finish your data science analysis, you’ll need to complete many steps – from collecting to manipulating to exploring the data. And eventually, you will need to communicate your findings somehow.
But first things first. To analyze the data, you must first obtain it. You need to know where to get it and how to integrate it into your respective tools. The data is rarely available as it would be necessary for further processing. Familiarizing yourself with the information available, cleaning it up, and processing it into the desired formats that humans and machines can read are essential steps that often make up a large part of the data scientist’s work.
Before you can analyze the obtained data, you must first select and master the right tool: the programming language.
The most often used languages for Data Science are R
, which was explicitly developed for statistics, and Python
, which is characterized by its additional versatility.
The data scientist does not have to be a perfect software developer who masters every detail and programing paradigm.
Still, the competent handling of syntax (writing code) and idiosyncrasies is essential for her.
There are some well-developed method collections, the so-called packages or libraries, which provide a lot of additional functionality to an elemental programing language. As a data scientist, you should also learn and master the use of these collections, especially when preparing the data. Once you have prepared your data, you can finally analyze it.
It is also crucial to know and understand the multitude of statistical approaches to choose the correct method for the problem at hand. The newest, best, and most beautiful neural network is not always the solution to everything! One step is still missing in the data science process: understanding and communicating your results. The results are often not spontaneously intuitive or sometimes even surprising. Here, the data scientist employs specific expertise and creativity, especially during data visualization.
2.1 What’s R?
R
is a programming language developed by statisticians in the early 90s to calculate and visualize statistical results.
A lot has happened since then, and by now, R
is one of the most widely used programming languages in Data Science.
You don’t have to compile your ‘R’ code, but you can use it interactively and dynamically.
Such an approach makes it possible to quickly gain basic knowledge about existing data and display it graphically.
R
offers much more than just programming.
The language provides a complete ecosystem for solving statistical problems.
A large number of packages and interfaces are available, which you can use to expand the basic functionality of the programing language to, say, create a COVID-Tracker application.
2.1.1 RStudio Cloud
Before you can use R
, you usually have to install some separate programs locally on your computer.
Typically, you first install a “raw” version of R
.
In theory, you can then start programming.
However, it is challenging to carry out an entire project with a “raw” version of R.
That’s why there is RStudio, a free Integrated Development Environment (IDE) for R
.
Such IDE includes many essential features that simplify programming with R
.
Among other things, an auto-completion of your code, a friendly user interface, and many expansion options.
Think of R as your car’s engine.
And think of RStudio as your car’s dashboard that shows fancy metrics, has a radio and allows you to adjust air-conditioning!
Experience has shown that installing R
and RStudio locally on your computer takes some effort.
Fortunately, RStudio also has a cloud solution that eliminates these steps: RStudio Cloud.
You can edit your project in the same IDE in the browser without any prior installations on your computer.
You can also easily switch your project from a private to a public project and give your team an insight into your code via a link or by giving them access to the workspace directly.
In this way, you can easily exchange ideas with your team.
We will introduce RStudio Cloud and unlock access to our workspace on our first Coding Meetup. Until then, focus on learning the “hard skills” of programming with the courses on DataCamp. That brings us to your curriculum in the next section!
2.1.2 Curriculum
The following list shows the required DataCamp courses for the Data Science with R
Track at TechAcademy.
As a beginner, please stick to the courses of the “beginner” program.
Ambitious beginners can, of course, take the advanced courses afterward.
However, it would be best if you worked through the courses in the order we listed them.
The same applies to the advanced courses. Here, too, you should finish the specified courses in the given order. Since it can, of course, happen that you have already mastered the topics of an advanced course, you can replace some courses. If you are convinced that the course does not add value to you, feel free to replace it with one of the courses in the “Exchange Pool” (see list below). However, you should not pursue an exchange course until you finish all chapters from the advanced course: “Intermediate R.”
To receive the certificate, both beginners and advanced learners must complete at least two-thirds of the curriculum (6/9 courses).
For the beginners, this means until – and including – the course “Data Visualization with ggplot2 (Part 1)” and for the advanced until –and including – “Supervised Learning in R
: Classification.” In addition, you should complete at least two-thirds of the project tasks.
After completing the curriculum and the project’s (minimal) requirements, you will receive your TechAcademy certificate!
R Fundamentals (Beginner Track)
- Introduction to R (4h)
- Intermediate R (6h)
- Introduction to Importing Data in R (3h)
- Cleaning Data in R (4h)
- Data Manipulation with dplyr (4h)
- Data Visualization with ggplot2 (Part1) (5h)
- Exploratory Data Analysis in R (4h)
- Correlation and Regression in R (4h)
- Multiple and Logistic Regression in R (4h)
Machine Learning Fundamentals in R (Advanced Track)
- Intermediate R (6h)
- Introduction to Importing Data in R (3h)
- Cleaning Data in R (4h)
- Importing & Cleaning Data in R: Case Studies (4h)
- Data Visualization with ggplot2 (Part1) (5h)
- Supervised Learning in R: Classification (4h)
- Supervised learning in R: Regression (4h)
- Unsupervised Learning in R (4h)
- Machine Learning with caret in R (4h)
Data Science R (Advanced Track) – Exchange Pool
2.1.3 Helpful Links
- RStudio Cheat Sheets
- RMarkdown Explanation (to document your analyses)
- StackOverflow (forum for all kinds of coding questions)
- CrossValidated (Statistics and Data Science forum)
2.2 What’s Python?
Python
is a dynamic programming language.
You can execute the code in the interpreter, so you do not have to compile the code first.
This feature makes Python
very easy and quick to use.
The excellent usability, easy readability, and simple structuring were and still are core ideas in developing this programming language.
You can use Python
to program according to any paradigm, whereby structured and object-oriented programming is most straightforward due to the structure of the language.
Still, functional or aspect-oriented programming is also possible.
These options give users significant freedom to design projects the way they want and great space to write code that is difficult to understand and confusing.
For this reason, programmers developed specific standards based on the so-called Python
Enhancement Proposals (PEP) over the decades.
2.2.1 Anaconda and Jupyter
Before you can use Python
, you must install it on the computer.
Python
is already installed on Linux and Unix systems (such as macOS), but often it is an older version.
Since there are differences in the handling of Python
version 2 – which is no longer supported – and version 3, we decided to work with version 3.6 or higher.
One of the easiest ways to get Python and most of the best-known programming libraries is to install Anaconda. There are detailed explanations for installing all operating systems on the website of the provider.
With Anaconda installed, all you have to do is open the Anaconda Navigator, and you’re ready to go.
There are two ways to get started: Spyder or Jupyter.
Spyder is the integrated development environment (IDE) for Python
and offers all possibilities from syntax highlighting to debugging (links to tutorials below).
The other option is to use Jupyter or Jupyter notebooks. It is an internet technology-based interface for executing commands. The significant advantage of this is that you can quickly write shortcode pieces and try them out interactively without writing an entire executable program. Now you can get started!
If you have not worked with Jupyter before, we recommend that you complete this DataCamp course first. There you will get to know many tips and tricks that will make your workflow with Jupyter much easier.
To make your work and, above all, the collaboration more accessible, we are working with the Google Colab platform that contains a Jupyter environment with the necessary libraries. You can then import all the data required for the project with Google Drive. We will introduce this environment during our first Coding Meetup. Until then, focus on learning the “hard skills” of programming with your courses on DataCamp. This topic brings us to your curriculum in the next section!
2.2.2 Curriculum
The following list shows the required DataCamp courses for the Data Science with Python
Track at TechAcademy.
As a beginner, please stick to the courses of the “beginner” program.
Ambitious beginners can, of course, take the advanced courses afterward.
However, it would be best if you worked through the courses in the order we listed them.
The same applies to the advanced courses. Here, too, you should finish the specified courses in the given order. Since it can, of course, happen that you have already mastered the topics of an advanced course, you can replace some courses. If you are convinced that the course does not add value to you, feel free to replace it with one of the courses in the “Exchange Pool” (see list below). However, you should not pursue an exchange course until you finish all chapters from the advanced course: “Intermediate Python.”
To receive the certificate, both beginners and advanced learners must complete at least two-thirds of the curriculum (6/9 courses). For the beginners, this means until – and including – the course “Joining Data with pandas (4h)” and for the advanced until –and including – “Exploratory Data Analysis in Phyton (4h).” In addition, you should complete at least two-thirds of the project tasks. After completing the curriculum and the project’s (minimal) requirements, you will receive your TechAcademy certificate!
Python Fundamentals (Beginner Track)
- Introduction to Data Science in Python (4h)
- Intermediate Python (4h)
- Python for Data Science Toolbox (Part 1) (3h)
- Introduction to Data Visualization with Matplotlib (4h)
- Data Manipulation with pandas (4h)
- Joining Data with pandas (4h)
- Exploratory Data Analysis in Phyton (4h)
- Introduction to DataCamp Projects (2h)
- Introduction to Linear Modeling in Python (4h)
Data Science with Python (Advanced Track)
- Intermediate Python (4h)
- Python Data Science Toolbox (Part 1) (3h)
- Python Data Science Toolbox (Part 2) (4h)
- Cleaning Data in Python (4h)
- Exploring the Bitcoin Cryptocurrency Market (3h)
- Exploratory Data Analysis in Phyton (4h)
- Introduction to Linear Modeling in Python (4h)
- Supervised Learning with Scikit-Learn (4h)
- Linear Classifiers in Python (4h)
Data Science with Python (Advanced Track) - Exchange Pool
- TV, Halftime Shows and the Big Game (4h)
- Interactive Data Visualization with Bokeh (4h)
- Time Series Analysis (4h)
- Machine Learning for Time Series Data in Python (4h)
- Advanced Deep Learning with Keras (4h)
- Data Visualization with Seaborn (4h)
- Web Scraping in Python (4h)
- Writing Efficient Python Code (4h)
- Unsupervised Learning in Python (4h)
- Writing Efficient Code with pandas (4h)
- Introduction to Deep Learning in Python (4h)
- ARIMA Models in Python (4h)
2.2.3 Helpful Links
Official Tutorials/Documentation:
Further Explanations:
2.3 Your Data Science Project
2.3.1 Coding Meetups and Requirements
Now that you have learned the theoretical foundation of Data Science in the DataCamp courses, you can put your skills into practice. We have put together a project for you based on real data sets. You can read about the details of this project in the following chapters of this project guide.
Of course, we will also describe the project and the tools that go with it. We will discuss everything you need to know during the first Coding Meetup, which will take place on November 24, 2021. After that, your work on the project will officially begin. You can find the exact project tasks together with further explanations and hints in the following chapters.
To receive your TechAcademy certificate, you must solve at least two-thirds of the “Exploratory Data Analysis” part of your Data Science project. We added the “Price Prediction – The Application of Statistical Models” part for the advanced participants. In addition, you should complete two-thirds (6/9 courses) of the respective curriculum on DataCamp, as mentioned. You can find more detailed information about the curriculum in the “Curriculum” section of the different programming languages above.