diff --git a/docs/units/_unit06/unit06-01_Intro.md b/docs/units/_unit06/unit06-01_Intro.md deleted file mode 100644 index b23001f..0000000 --- a/docs/units/_unit06/unit06-01_Intro.md +++ /dev/null @@ -1,33 +0,0 @@ ---- -title: Introduction -header: - image: /assets/images/unit_images/u04/header.png - image_description: "index" - caption: "Photo by [Maksym Kaharlytskyi](https://unsplash.com/@qwitka?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) from [Unsplash](https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)" ---- -*You will learn how to access the indices.* - - -In the previous unit, you got an idea of how to address some elements of an object. Let's take a closer look now. -At the end of this unit you should know how to - -* find -* address and -* change - -specific data in arrays, lists, matrices and data.frames. - -## Comments? -You can leave comments below if you have questions or remarks about any of the text or code in this unit. -Please copy the corresponding line into your comment to make it easier to answer your question. - - - - - diff --git a/docs/units/_unit06/unit06-02_csv.md b/docs/units/_unit06/unit06-02_csv.md deleted file mode 100644 index 2c5ec10..0000000 --- a/docs/units/_unit06/unit06-02_csv.md +++ /dev/null @@ -1,71 +0,0 @@ ---- -title: "CSV" -toc: TRUE -toc_float: TRUE -toc_label: In this example -header: - image: /assets/images/unit_images/u06/header.png - image_description: "computer" - caption: "Photo by [Free-Photos](https://pixabay.com/photos/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=336373) [Pixabay](https://pixabay.com/de/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=336373)" ---- - -> “Torture the data, and it will confess to anything.” – Ronald Coase - - -## An introduction to comma-seperated values - -### What is a .csv-file? -A csv is a comma-separated values file, which allows data to be saved in a tabular format. CSVs look like a spreadsheet but with a .csv extension. CSV files can be used with the most of spreadsheet programs, such as Microsoft Excel or Google Spreadsheets. They differ from other spreadsheet file types because they can only have a single sheet in a file, they can not save cells, columns, or rows. Also, there cannot be saved formulas in this format. These files may sometimes be called Character Separated Values or Comma Delimited files. They mostly use the comma character to separate (or delimit) data, but sometimes use other characters, like semicolons. - -### Why are .csv files used? -These files serve a number of different purposes. They are often used for exchanging data between different applications, for example databases. They help to export a high volume of data to a more concentrated database, for instance. They also serve other primary functions: - -* CSV files are plain-text files, making them easier for the website developer to create. -* Since they're plain text, they're easier to import into a spreadsheet or another storage database, regardless of the specific used software. -* To better organize large amounts of data. - -The idea is that it is possible to export complex data from one application to a csv file, and then import the data in that csv file into another application. - -### How do I save .csv files? -Saving csv files is relatively easy, you just need to know where to change the file type. - -Under the "File name" section in the "Save As" tab, there can be selected `Save as` type and change it to `CSV (Comma delimited) (*.csv)`. This should be the same for both Apple and Microsoft operating systems. - -### The Structure of a .csv File -A csv file has a fairly simple structure. It’s a list of data separated by commas or semicolons. For example, let’s say there are a few contacts in a contact manager, and they are exported as a csv file. The result is a file containing text like this: - -```r - ID;Name;Email;Phone_Number;Address - 01;Bob Smith;bob@example.com;123-456-7890;123 Fake Street - 02;Mike Jones;mike@example.com;098-765-4321;321 Fake Avenue -``` - -They can be more complicated than that, and can contain thousands of lines, more entries on each line, or long strings of text. Some CSV files may not even have the headers at the top, and some may use quotation marks to surround each bit of data, but that’s the basic format. - -That simplicity is a feature. CSV files are designed to be a way to easily export data and import it into other programs. The resulting data is human-readable and can be easily viewed with a text editor like Notepad or a spreadsheet program like Microsoft Excel. - -## Know your decimal seperator! -As mentioned in [Unit 2.5 - Types of Data](https://geomoer.github.io/moer-base-r/unit02/unit02-05_types_of_data.html){:target="_blank"} it is essentially to know the decimal seperator. - -When you check `?read.table` you will probably find all the answer that you need. -There are two issues with European csv files: - -* What does the `c` in csv stand for? For standard csv this is a `,`, for European csv this is a `;`, `sep=""` is the corresponding argument in `read.table` -* What is the character for the decimal point? For standard csv this is a `.`, for European csv this is a `,`, `dec=""` is the corresponding argument in `read.table` - -To read standard csv use `read.csv`, to read European csv use `read.csv2`. These two functions are just wrappers to `read.table` that set the appropriate arguments. - -If your file does not follow either of these standards set the arguments manually with `sep` and `dec`. - - -{% include figure image_path="/assets/images/unit_images/u06/602_csv.PNG" caption="Structure of an csv file." %} - - -Let's move on to the next chapter: Working with csv ... - - - diff --git a/docs/units/_unit06/unit06-02_index.md b/docs/units/_unit06/unit06-02_index.md deleted file mode 100644 index 8448da5..0000000 --- a/docs/units/_unit06/unit06-02_index.md +++ /dev/null @@ -1,328 +0,0 @@ ---- -title: Indexing -toc: TRUE -toc_float: TRUE -header: - image: /assets/images/unit_images/u04/header.png - image_description: "index" - caption: "Photo by [Maksym Kaharlytskyi](https://unsplash.com/@qwitka?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) from [Unsplash](https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)" ---- - - - -> “Big data is at the foundation of all of the megatrends that are happening today, from social to mobile to the cloud to gaming.” – Chris Lynch - ---- - -There are multiple ways to access or replace values in vectors or other data structures. The most common approach is to use “indexing”. This is also referred to as “slicing”. - -Note that brackets `[]` are used for indexing, whereas parentheses `( )` are used to call a function. - -## Arrays -Here are some examples that show how elements of arrays can be obtained by indexing. - -Consider vector `v`: -```python -# load NumPy library -import numpy as np - -# Generating some data -v = np.arange(10, 16) -print(v) -# Output: -# [10 11 12 13 14 15] -``` -Note that when working with ranges or slicing in python, the ending value is never included. So even if we set a range of 10 to 16 the output values range from 10 to 15.
- -Now lets start indexing. Let's get the first element of the array and print it. -```python -# Get the first element of a vector -print(v[0]) # Output: 10 -``` -As you see we used a 0 instead of a 1 to access the first value of the array. This is because python's indexing start with 0. So if we want to access the fourth element we would need to write a 3 like this. - -```python -# Accessing the fourth element -print(v[3]) # Output: 13 -``` -You can also access multiple elements at a time using ranges or another pair of `[]` to chose elements which are not directly next to each other. -```python -print(v[1:3]) # Output: [11 12] - -print(v[[1, 3]]) # Output: [11, 13] -``` -As you see here again the last index of the range (3) is not included in the out. - -You can also use an index to change values - -```python -v[0] = 11 # Change position 1 from 10 to 11 -print(v) # Output: [11 11 12 13 14 15] - -v[[3,5]] = [0,5] -print(v) # Output: [11 11 12 0 14 5] - -v[1:6] = -99 -print(v) # Output: [ 11 -99 -99 -99 -99 -99] -``` - -## Matrix -Consider matrix `m`. -```python -import numpy as np - -# Creating the matrix -m = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) - -# Printing the matrix -print(m) - -# Output: -# [[1 2 3] -# [4 5 6] -# [7 8 9]] -``` - -Like arrays, values of matrices can be accessed through indexing. There are different ways to do this, but it is generally easiest to use two numbers in a double index, the first for the row number(s) and the second for the column number(s). - -```python -# One value in row 2, column 2 -print(m[1, 1]) - -# Output: -# 5 - -# Another one in row 1, column 3 -print(m[0, 2]) - -# Output: -# 3 -``` - -You can also get multiple values at once. - -```python -# 2 rows and 2 columns -print(m[0:2, 0:2]) - -# Output: -# [[1 2] -# [4 5]] - -# Entire row -print(m[1, :]) - -# Output: -# [4 5 6] - -# Entire column -print(m[:, 1]) - -# Output: -# [2 5 8] -``` - -You can also use slicing to get specific columns or rows - -```python -# Single column -print(m[:, 1]) - -# Output: -# [2 5 8] - -# Two columns -print(m[:, [0, 2]]) - -# Output: -# [[1 3] -# [4 6] -# [7 9]] -``` -Setting values of a matrix is similar to how you would do that for an arrays, except that you now need to deal with two dimensions. - -```python -# One value -m[0, 0] = 5 -print(m) - -# Output: -# [[5 2 3] -# [4 5 6] -# [7 8 9]] - -# A row -m[2, :] = 10 -print(m) - -# Output: -# [[ 5 2 3] -# [ 4 5 6] -# [10 10 10]] -``` - -## List -Indexing lists can be a bit confusing as you can both refer to the elements of the list, or the elements of the data (perhaps a matrix) in one of the list elements. - -```python -# Create a list containing another list, a list of strings, and a NumPy matrix -import numpy as np -e = [[1, 2, 3], ['a', 'b', 'c', 'd'], np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])] - -print(e) - -# Output: -# [[1, 2, 3], ['a', 'b', 'c', 'd'], array([[1, 2, 3], -# [4, 5, 6], -# [7, 8, 9]])] -``` - -We can access data inside a list element by combining double and single brackets. By using the double brackets, the list structure is dropped. - -```python -# Accessing an element of the list (returns a list) -print(e[1]) -# Output: ['a', 'b', 'c', 'd'] - -# Accessing the contents of the list element directly -print(e[1][0]) -# Output: 'a' - -# Accessing the contents of the matrix -print(e[2][1,1]) -# Output: 5 -``` - -## Data.frame -ndexing a DataFrame in Python can be done similarly to indexing matrices and lists, using the pandas library. - -First, create a DataFrame from a NumPy array m. - -```python -# Create a DataFrame from the matrix -d = pd.DataFrame(m, columns=['a', 'b', 'c']) -print(type(d)) -# Output: - -print(d) -# Output: -# a b c -# 0 1 2 3 -# 1 4 5 6 -# 2 7 8 9 -``` - -You can extract a column by column number. - -```python -# Extract the second column by index -print(d.iloc[:, 1]) -# Output: -# 0 2 -# 1 5 -# 2 8 -# Name: b, dtype: int64 -``` -But you can also extract a column by name - -```python -# Extract the second column by name -print(d['b']) -# Output: -# 0 2 -# 1 5 -# 2 8 -# Name: b, dtype: int64 -``` -You can also use the dot notation or the get method. - -```python -# Using dot notation to get the column values -print(d.b) -# Output: -# 0 2 -# 1 5 -# 2 8 -# Name: b, dtype: int64 - -# Or using the .get method -print(d.get('b')) -# Output: -# 0 2 -# 1 5 -# 2 8 -# Name: b, dtype: int64 -``` - - -By default, pandas will drop the DataFrame structure when selecting a single column, returning a Series. To prevent this and keep the DataFrame structure: - -```python -# Select a column without dropping the DataFrame structure -print(d[['b']]) -# Output: -# b -# 0 2 -# 1 5 -# 2 8 - -# Another way to avoid dropping is to use the loc method -print(d.loc[:, ['b']]) -# Output: -# b -# 0 2 -# 1 5 -# 2 8 -``` - -Why should you care about this drop business? Well, in many cases R functions want a specific data type, such as a matrix or data.frame and report an error if they get something else. One common situation is that you think you provide data of the right type, such as a data.frame, but that in fact you are providing a vector, because the structure dropped. - -## Finding Indices, isin, and match -Sometimes you do not have the indices you need, and so you need to find them. For example, what are the indices of the elements in a vector that have values above 15? - -```python -import numpy as np - -# Create a numpy array -x = np.arange(10, 21) -i = np.where(x > 15) -print(x) -# Output: [10 11 12 13 14 15 16 17 18 19 20] - -print(i) -# Output: (array([6, 7, 8, 9, 10]),) - -print(x[i]) -# Output: [16 17 18 19 20] -``` -So here we looked the indicies of values being bigger than 15. These indicies we can than use to get the values which are bigger than 15.
-You can also use a boolean array for indexing. - -```python -# Create a boolean array -b = x > 15 - -print(b) -# Output: [False False False False False False True True True True True] - -print(x[b]) -# Output: [16 17 18 19 20] -``` - -The np.isin function allows you to check whether each element of an array is in another array. It returns 'True' or 'False' for each element of the first array. With np.where you can than get the indicies which are 'True' - -```python -j = np.array([7, 9, 11, 13]) -print(np.isin(j, x)) -# Output: [False False True True] - -print(np.where(np.isin(j, x))) -# Output: (array([2, 3]),) -``` - - - diff --git a/docs/units/_unit06/unit06-03_data_import.md b/docs/units/_unit06/unit06-03_data_import.md deleted file mode 100644 index 775d157..0000000 --- a/docs/units/_unit06/unit06-03_data_import.md +++ /dev/null @@ -1,172 +0,0 @@ ---- -title: Importing Data -toc: TRUE -toc_float: TRUE -header: - image: /assets/images/unit_images/u06/header.png - image_description: "computer" - caption: "Photo by [Free-Photos](https://pixabay.com/photos/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=336373) [Pixabay](https://pixabay.com/de/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=336373)" ---- - - - - -## Importing data via buttons - -The data import features can be accessed from the environment pane or from the tools menu. The importers are grouped into 3 categories: Text data, Excel data and statistical data. To access this feature, use the "Import Dataset" dropdown from the "Environment" pane: - -{% include figure image_path="/assets/images/unit_images/u06/602_data-import-via-envi.png" caption="Data import via environment. ©[Javier Luraschi](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio)" %} - - -Or through the "File" menu, followed by the "Import Dataset" submenu: -{% include figure image_path="/assets/images/unit_images/u06/602_data-import-via-file.png" caption="Data import via Filemenu. ©[Javier Luraschi](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio)" %} - - -Importing `From Text (readr)` files allows you to import csv files and in general, character delimited files using the `readr package`. This Text importer provides support to: - -* Import from the file system or an url -* Change column data types -* Skip or include-only columns -* Rename the data set -* Skip the first n rows -* Use the header row for column names -* Trim spaces in names -* Change the column delimiter -* Encoding selection -* Select quote, escape, comment and NA identifiers - - -{% include figure image_path="/assets/images/unit_images/u06/602_data-import-rstudio-overview.gif" caption="Data import via environment. ©[Javier Luraschi](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio)" %} - - -## Importing data via code -Reading or writing tabulated data into or from a data frame is a quite common task in data analysis. The `read.table` or the `read.csv` functions for this. - -```r -df <- read.csv(file.csv) # or read.csv2 -> selection depends on the delimiter and seperator -``` - -```r -df <- read.table(file.csv, header = FALSE, sep = “”, - dec = “.”, skip = 0, ...) - -write.table(df, file = , sep = “ “, dec = “.”, ...) -``` - -## Reading data from csv files - -Reading csv files is realized using the `read.table` function from R's utils library. The function will return a data frame which contains the information of the csv file (example taken from [here](https://www.regionalstatistik.de/genesis/online/){:target="_blank"}) . - - -```r -df <- read.table(paste0(envrmt$path_data_csv, "/AI001_gebiet_flaeche.csv"), - skip = 4, header = TRUE, sep = ";", dec = ",") -``` - -As you can see, the `read.table` function gets several arguments (which is common for many functions). The first one gives the filename inclducing the path to the file. -* `skip = 4` tells the function to skip the first four lines (which are plain text lines in this case and not tabulated values) -* `header = TRUE` tells the function, that the csv file has a header line which is used by `read.table` to name the columns of the returning data frame -* `sep = ";"` defines the separator of the individual columns in the data frame -* `dec = ","` defines the decimal separator used in the dataset - -A note on the sequence of the arguments: the sequence of the arguments does -not matter as long as you name them explicetly. If you do not use the argument -identfier as it is the case for the first argument, the filename, in the example -then the sequence matters. To get information on the default sequence and of -course the general application of the each R function, type `?` -(e.g. `?read.table`) in an R console. - -After you executing the `read.table` function above, the content of the csv file is -stored into a two dimensional data frame called `df` in the example above. - -A quick way to check if everything is fine is to display the first few lines of -the data file using the `head` function (without the 2, it will print 5 lines as a standard setting). - -```r -head(df,2) -``` - -``` -## X X.1 X.2 -## 1 1996 DG Deutschland -## 2 1996 01 Schleswig-Holstein -## Anteil.Siedlungs..und.Verkehrsfläche.an.Gesamtfl. -## 1 11,8 -## 2 10,8 -## Anteil.Erholungsfläche.an.Gesamtfläche -## 1 0,7 -## 2 0,7 -## Anteil.Landwirtschaftsfläche.an.Gesamtfläche -## 1 54,1 -## 2 73,0 -## Anteil.Waldfläche.an.Gesamtfläche -## 1 29,4 -## 2 9,3 -``` - -## Writing data to csv files -Writing data is as easy as reading it. Just use the `write.table` function. - -```r -write.table(df, file = paste0(envrmt$path_data_tmp, "new.csv"), - sep = ",", dec = ".") -``` -As you can see, you can define any column or decimal separator. - -For more information have a look at e.g. the respective [data importing](http://www.statmethods.net/input/importingdata.html) and [data exporting](http://www.statmethods.net/input/exportingdata.html) site at Quick R, have a look into the package -documentation or search the web. - - -## Alternative data I/O using RDS -Writing into CSV files is a good choice for data exchange into the non-R world. If you want to re-use the information stored in a data.frame or any other variable in R, writing the actual R object to a file connection will be of some benefit especially for more complex objects like model outputs or geo-datasets which do not represent the final information (because this would likely be shared to others using GeoTiff or other well recognized formats). To save an R object to a file connection, use the `saveRDS` function, to read it, use the `readRDS` function. The file extension **.rds** is generally used for that format. - -If you stay within R for reading and writing R objects from and to data files, you could also use the serialization of readRDS and saveRDS. - -```r -saveRDS(df, file = paste0(envrmt$path_data_tmp, "new.rds")) - -# Read data to different variable -df2 = readRDS(paste0(envrmt$path_data_tmp, "new.rds")) -``` - - - -```r -saveRDS(df, file = ) - -df = readRDS() -``` - -See the respective help pages for more details. - - -Reading or writing tabulated data into or from a data frame is a quite common task in data analysis. You could use the read.table function for this. -```yaml -df <- read.table(, header = FALSE, sep = “”, - dec = “.”, skip = 0, ...) - -write.table(df, file = , sep = “ “, dec = “.”, ...) -``` - -Writing into CSV files is a good choice for data exchange into the non-R world. If you want to re-use the information stored in a data.frame or *any* other variable in R, writing the actual R object to a file connection will be of some benefit especially for more complex objects like model outputs or geo-datasets which do not represent the final information (because this would likely be shared to others using GeoTiff or other well recognized formats). To save an R object to a file connection, use the saveRDS function, to read it, use the readRDS function. The file extension "rds" is generally used for that format. -```yaml -saveRDS(df, file = ) - -df = readRDS() -``` -See the respective help pages for more details. - - -## Test what you've learned so far! - - - - - - - - diff --git a/docs/units/_unit06/unit06-03_subsetting.md b/docs/units/_unit06/unit06-03_subsetting.md deleted file mode 100644 index 4b81c3e..0000000 --- a/docs/units/_unit06/unit06-03_subsetting.md +++ /dev/null @@ -1,106 +0,0 @@ ---- -title: Subsetting -toc: TRUE -toc_float: TRUE -header: - image: /assets/images/unit_images/u04/header.png - image_description: "index" - caption: "Photo by [Maksym Kaharlytskyi](https://unsplash.com/@qwitka?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) from [Unsplash](https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)" ---- - - - -> “Big data is at the foundation of all of the megatrends that are happening today, from social to mobile to the cloud to gaming.” – Chris Lynch - ---- - -When you are working with a large data set, you are often only interested in a small part of it for your analysis. So how do you sort out all the variables and observations and extract only the ones you need? Well, python has several ways of doing this in a process it calls "subsetting". - -The most basic way of subsetting a data frame in R is by using square brackets such that in: - -``` -data[x,y] -``` -`data` is the data frame we want to subset. `x` consists of the rows we want returned, and `y` consists of the columns we want returned. Let’s pull some data from the dataset `mtcars` and see how this is done on a real data set. - -Now, let’s suppose we only need *mpg, hp* and *gear* to show the relationship between fuel consumption, horsepower and number of gears. However, we only need data seperated by the kind of transmission (am): 0 = automatic, 1 = manual. Here’s the basic way to retrieve that data in R: - -```python -import pandas as pd - -# Load the mtcars dataset -mtcars = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/mtcars.csv') - -# Extracting data -manual1 = mtcars[mtcars['am'] == 1].iloc[:, [0, 3, 9]] -automatic1 = mtcars[mtcars['am'] == 0].iloc[:, [0, 3, 9]] -``` - -With ` mtcars[mtcars['am'] == 1]` we select all rows which have a 1 in the column am which stands for a manual car. With `.iloc[:,0, 3, 9]` we then select the columns we are interested in. - -S0 we separated the record (mtcars) according to the type of transmission (mtcars$am==1) and **used only** the first (mpg), fourth (hp) and tenth (gears) columns. - -You can also specify columns to drop instead of selecting columns to keep. -```python -# Dropping data -manual2 = mtcars[mtcars['am'] == 1].drop(columns=['cyl', 'disp', 'drat', 'wt', 'qsec', 'vs', 'carb']) -automatic2 = mtcars[mtcars['am'] == 0].drop(columns=['cyl', 'disp', 'drat', 'wt', 'qsec', 'vs', 'carb']) -``` - ---- - -## subseting functionality -The `query` method and column selection can be used to subset a DataFrame without knowing the row and column indices. -The `query` method takes a single string argument that represents the condition for filtering the DataFrame. Additionally, you can specify the columns you want to select by chaining the loc method. - -```python -# The query meothd -manual3 = mtcars.query('am == 1')[['mpg', 'hp', 'gear']] -automatic3 = mtcars.query('am == 0')[['mpg', 'hp', 'gear']] -``` - -In the following example, we select by different values: - -First all rows that have a value of horsepower (hp) between 100 and 200. -Second all rows that have a value of horsepower greater or equal to 150 AND are automatic. -Third all rows that have a value of horsepower greater or equal to 250 OR have less than 4 gears. -We keep the mpg, hp and gear columns for all three. - -```python -# hp between 100 and 200 -hp = mtcars.query('100 <= hp < 200')[['mpg', 'hp', 'gear']] - -# hp >=150 AND automatic -hp_am = mtcars.query('hp >= 150 & am == 0')[['mpg', 'hp', 'gear', 'am']] - -# hp >=250 OR <4 gears -hp_gear = mtcars.query('hp >= 250 | gear < 4')[['mpg', 'hp', 'gear', 'am']] -``` - ---- - -## Random sample - -Generally speaking, whenever we introduce randomness in Python, we should set a random seed to make our code reproducible. The `np.random.seed()` function sets the starting number used to generate a sequence of random numbers. This ensures that you get the same result if you start with that same seed each time you run the same process. The seed is an arbitrary number. You can then use the `sample()` method from pandas to take a random sample of size n from a DataFrame either with or without replacement. - -```python -# Setting the seed for reproducibility -np.random.seed(1234) -``` - -Random sample of a data frame: -```python -# Take a random sample of 10 rows without replacement -sampled_df = mtcars.sample(n=10, replace=False) - -print(sampled_df) -``` - - - - - diff --git a/docs/units/_unit06/unit06-04_batch_processing.md b/docs/units/_unit06/unit06-04_batch_processing.md deleted file mode 100644 index b6bcc0a..0000000 --- a/docs/units/_unit06/unit06-04_batch_processing.md +++ /dev/null @@ -1,70 +0,0 @@ ---- -title: "Batch processing" -toc: TRUE -toc_float: TRUE -header: - image: /assets/images/unit_images/u06/header.png - image_description: "computer" - caption: "Photo by [Free-Photos](https://pixabay.com/photos/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=336373) [Pixabay](https://pixabay.com/de/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=336373)" ---- - - - -Below is an example for batch processing of multiple files in a for-loop. -Each file in the "input" directory is sequentially processed (in this example getting the file size) -and the result is written into a corresponding file with the same file name in the "output" directory. - -Note that the management of file names and file paths is crucial for batch processing. -So know the path to your working directory, the paths to your input and output folders, the paths to your files, the names of your files and their extention. - - - - - -### Setting up your working environment - -```r -# set working directory with input and output folders -path_dir <- "~/data/test" -setwd(path_dir) - -# get file names and path names -file_names <- list.files(file.path(path_dir, "input")) # only file names -file_paths <- list.files(file.path(path_dir, "input"), full.names = TRUE) # complete paths to files. Necessary to open files. -``` - -### Batch process your files - - -```r -# loop over all files. Note that files are identified by their file paths. -for (file_path in file_paths){ - # Some calculation for each input file - size <- file.info(file_path)$size # Get file size for each file referred to in file_paths - # Use the results and write out. One output file for one input file - write(size, file = file.path(path_dir, "output", paste0(basename(file_path), ".out"))) - # Show status of each iteration - cat("Processing:", file_path, "\t", "result of operation: ", size, "\n") -} -``` - -This is all it takes for also performing more complex operations using multiple input files. - -### Functions you should know - -```r -setwd() -file.path() -list.files() -basename() -dirname() -substr(basename(file_path), 1, nchar(basename(file_path))-4) # for removing the file extension, i.e. the last four characters. - -library("tools") -file_ext(file_path) # get file extension -file_path_sans_ext(file_path) # get file path without extension -file_path_as_absolute(file_path) # get full file path - -# The regex way for getting the pure file name without path and extension -sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(file_path)) -``` diff --git a/docs/units/_unit06/unit06-04_summary.md b/docs/units/_unit06/unit06-04_summary.md deleted file mode 100644 index 888581b..0000000 --- a/docs/units/_unit06/unit06-04_summary.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -title: "Summary" -published: TRUE -toc: TRUE -toc_float: TRUE -header: - image: /assets/images/unit_images/u04/header.png - image_description: "index" - caption: "Photo by [Maksym Kaharlytskyi](https://unsplash.com/@qwitka?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) from [Unsplash](https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)" ---- - -## Test - - - - - - - - - diff --git a/docs/units/_unit06/unit06-05_summary.md b/docs/units/_unit06/unit06-05_summary.md deleted file mode 100644 index f32ba3b..0000000 --- a/docs/units/_unit06/unit06-05_summary.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -title: "Summary" -published: FALSE -toc: TRUE -toc_float: TRUE -header: - image: /assets/images/unit_images/u06/header.png - image_description: "computer" - caption: "Photo by [Free-Photos](https://pixabay.com/photos/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=336373) [Pixabay](https://pixabay.com/de/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=336373)" ---- - - - -## Test what you've learned so far! - - - - - -