Skip to content

kristbaum/unstruwwel-py

Repository files navigation

unstruwwel

Overview

This R package provides means to detect and parse historic dates, e.g., to ISO 8601:2-2019. It automatically converts language-specific verbal information, e.g., “circa 1st half of the 19th century,” into its standardized numerical counterparts, e.g., “1801-01-01~/1850-12-31~.” The package follows the recommendations of the MIDAS (Marburger Informations-, Dokumentations- und Administrations-System), see, e.g., https://doi.org/10.11588/artdok.00003770. It internally uses lubridate. The name of the package is inspired by Heinrich Hoffmann’s rhymed story “Struwwelpeter”, which goes as follows:

Just look at him! there he stands, with his nasty hair and hands. See! his nails are never cut; they are grimed as black as soot; and the sloven, I declare, never once has combed his hair; anything to me is sweeter than to see Shock-headed Peter.

For the German-language original text, see the online digital library Wikisource.

Installation

You can install the released version of unstruwwel from CRAN with:

install.packages("unstruwwel")

To install the development version from GitHub use:

# install.packages("devtools")
devtools::install_github("stefanieschneider/unstruwwel")

Usage

The unstruwwel package contains only one function, unstruwwel(), that does all the magic language-specific standardization. unstruwwel() returns a named list, where each element is the result of applying the function to the corresponding element in the input vector.

English-language examples

dates <- c(
  "5th century b.c.", "unknown", "late 16th century", "mid-12th century",
  "mid-1880s", "June 1963", "August 11, 1958", "ca. 1920", "before 1856"
)

# returns valid ISO 8601:2-2019 dates
unlist(unstruwwel(dates, "en", scheme = "iso-format"), use.names = FALSE)
#> [1] "-0500-12-31/-0401-01-01" NA                        "1586-01-01/1600-12-31"  
#> [4] "1146-01-01/1155-12-31"   "1884-01-01/1885-12-31"   "1963-06-01/1963-06-30"  
#> [7] "1958-08-11/1958-08-11"   "1920-01-01~/1920-12-31~" "..1855-12-31"

# returns a numerical interval of length 2 
unstruwwel(dates, language = "en", scheme = "time-span") %>%
  tibble::as_tibble() %>% dplyr::mutate(id = dplyr::row_number()) %>% 
  tidyr::gather(key = id) %>% tidyr::unnest_wider(value) %>% 
  dplyr::rename_all(dplyr::funs(c("text", "start", "end")))
#> # A tibble: 9 × 3
#>   text              start   end
#>   <chr>             <dbl> <dbl>
#> 1 5th century b.c.   -500  -401
#> 2 unknown              NA    NA
#> 3 late 16th century  1586  1600
#> 4 mid-12th century   1146  1155
#> 5 mid-1880s          1884  1885
#> 6 June 1963          1963  1963
#> 7 August 11, 1958    1958  1958
#> 8 ca. 1920           1920  1920
#> 9 before 1856        -Inf  1855

German-language examples

dates <- c(
  "letztes Drittel 15. und 1. Hälfte 16. Jahrhundert", "undatiert", "1460?",
  "wohl nach 1923", "spätestens 1750er Jahre", "1897 (Guss vmtl. vor 1906)"
)

# returns valid ISO 8601:2-2019 dates
unlist(unstruwwel(dates, "de", scheme = "iso-format"), use.names = FALSE)
#> [1] "1467-01-01/1550-12-31"   NA                        "1460-01-01~/1460-12-31~"
#> [4] "1924-01-01?.."           "..1749-12-31"            "..1905-12-31?"

# returns a numerical interval of length 2 
unstruwwel(dates, language = "de", scheme = "time-span") %>%
  tibble::as_tibble() %>% dplyr::mutate(id = dplyr::row_number()) %>% 
  tidyr::gather(key = id) %>% tidyr::unnest_wider(value) %>% 
  dplyr::rename_all(dplyr::funs(c("text", "start", "end")))
#> # A tibble: 6 × 3
#>   text                                              start   end
#>   <chr>                                             <dbl> <dbl>
#> 1 letztes Drittel 15. und 1. Hälfte 16. Jahrhundert  1467  1550
#> 2 undatiert                                            NA    NA
#> 3 1460?                                              1460  1460
#> 4 wohl nach 1923                                     1924   Inf
#> 5 spätestens 1750er Jahre                            -Inf  1749
#> 6 1897 (Guss vmtl. vor 1906)                         -Inf  1905

Contributing

Please report issues, feature requests, and questions to the GitHub issue tracker. We have a Contributor Code of Conduct. By participating in unstruwwel you agree to abide by its terms.

About

Python version of the ustruwwel library

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published