layout | title |
---|---|
presentation |
Miscellaneous slides |
The previous example is a simple way to handle errors.
- we'll learn a better way when we cover exceptions later in the course
--
Some general rules about error handling:
- If your code is making assumptions about its inputs, check them!
- for example, if your code assumes a parameter is positive, check and alert the user if this is not the case (and exit if appropriate)
- this will save you time later on!
--
- If you can reasonably handle the error, do so, but warn the user about it
--
- Most of the time, the error will be fatal (not recoverable), in which case your program should not proceed
- issue an informative message about the problem, and exit with a non-zero exit code
--
- try to detect errors as early as possible
- for example: don't start a long batch of processing without first checking that you can create the output file
name: type_alias
We have already started using a vector of strings to store the fragments – and it does seem to provide all the functionality we need at this point
--
We could type std::vector<std::string>
as the type for the fragment list
wherever this is needed
- but it is a bit long...
- it is not obvious what its purpose is to outside observers
- what if we decide to use a different type to store the fragments at a later stage? We'd need to modify our code in lots of different places
--
This is an example where defining a type alias might be a good idea:
using Fragments = std::vector<std::string>;
Now we can use Fragments
as the type for our list of fragments, and the
compiler will understand that to mean std::vector<std::string>
- note this is not a new type - merely an alias for an existing type
What does this look like in our project?
...
*using Fragments = std::vector<std::string>;
`Fragments` load_fragments (const std::string& filename)
{
...
...
void fragment_statistics (const `Fragments`& fragments)
{
...
--
.explain-bottom[ Exercise: modify your code as shown here ]
name: structured_binding
class: section
The C++17 standard introduced the ability to use structured binding
- this allows us to declare and initialise multiple variables from the individual members of a larger composite variable
--
This is best illustrated by example, using our find_biggest_overlap()
function:
`auto overlap` = find_biggest_overlap (sequence, fragments);
std::cerr << std::format ("overlap of size {} at index {}\n",
`overlap.size`, `overlap.fragment`);
--
Using structured binding, we can declare independent variables for each
member of the Overlap
structure returned:
auto [ overlap, index ] = find_biggest_overlap (sequence, fragments);
std::cerr << std::format ("overlap of size {} at index {}\n", `overlap`, `index`);
--
.explain-bottom[ Exercise: modify your own code to use structured binding as shown here ]
class: info
The C++ STL provides a convenience template class that can be used as a
lightweight replacement for a dedicated struct
in a variety
of contexts: std::tuple
One use for std::tuple
is for structured binding:
#include <tuple>
std::tuple<int,int> find_biggest_overlap (const std::string& sequence,
Fragments& fragments)
{
...
return { biggest_overlap, fragment_with_biggest_overlap };
}
...
auto [ overlap, index ] = find_biggest_overlap (sequence, fragments);
class: info
While appealing in its simplicity, std::tuple
does not provide any indication
about the interpretation or meaning of each member
- in our case, all we know is that we get 2
int
s - which one represents the overlap size, and which the index?
- We have to rely on additional documentation to work this out
- this could be described in a comment, or some other external document
--
On the other hand, using a struct
provides named members, whose interpretation is
immediately much clearer.
⇒ Better to use a struct
with named members rather than a std::tuple
where possible
--
The std::tuple
class is very useful in other contexts, especially when
working with generic functions and containers
- a generic function or container will not know ahead of time what it is to contain
std::tuple
provides a simple container for an arbitrary number of variables of arbitrary typesstd::tuple
is therefore used in many places throughout the C++ STL
Structured binding offers another way to return multiple values from a function
- this can sometimes be more convenient or more legible
--
There are many other uses for structured binding
- many of these are too advanced for this course
- indeed, arguably some of the most useful were only introduced in C++23!
- if interested, have a look online for examples
name: string_view
class: section
Let's take a look at our code so far, and try to reason about where most of the time is spent
--
The bulk of our algorithm consists of a series of nested loops:
• loop while there are fragments left to merge
• loop over all candidate fragments
• loop over all overlap sizes
• extract the corresponding substrings and compare
• merge the fragment with the largest overlap
• remove it from the list of candidate fragments
--
When looking for opportunities to improve performance, we should focus our attention on the innermost loop
- this is the portion of code that will be executed most often
--
⇒ what is going on in the innermost loop?
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
...
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
...
-- We actually have three loops, although we did not write them up ourselves!
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
...
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = `sequence.substr(0, overlap)`;
const auto frag_end = `fragment.substr(fragment.size()-overlap)`;
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
...
We actually have three loops, although we did not write them up ourselves!
- one for extracting each substring
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
...
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (`seq_start == frag_end`) {
largest_overlap = overlap;
break;
}
}
...
We actually have three loops, although we did not write them up ourselves!
- one for extracting each substring
- one for the comparison itself
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
...
for (int overlap = fragment.size(); overlap > 0; --overlap) {
`const auto seq_start` = sequence.substr(0, overlap);
`const auto frag_end` = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
...
We actually have three loops, although we did not write them up ourselves!
- one for extracting each substring
- one for the comparison itself
We also have some implicit memory allocation & deallocation for each substring
- we will cover this much later in the course
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
...
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
...
- there is not much that can be done about the comparison itself
- thankfully, it will tend to terminate early (as soon as any character doesn't match) --
- but can we avoid copying each substring into a whole new
std::string
to perform the comparison?
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
...
for (int overlap = fragment.size(); overlap > 0; --overlap) {
bool match = true;
for (int n = 0; n < overlap; ++n) {
if (sequence[n] != fragment[fragment.size()-overlap+n]) {
match = false;
break;
}
}
if (match) {
largest_overlap = overlap;
break;
}
}
...
We can perform the comparison manually character-by-character, ensuring we compare the characters at the correct positions
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
...
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = sequence.substr(0, overlap);
const auto frag_end = fragment.substr(fragment.size()-overlap);
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
...
... or we can use the std::string_view
class to achieve the same effect!
#include <string_view>
...
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
...
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = `std::string_view (sequence)`.substr(0, overlap);
const auto frag_end = `std::string_view (fragment)`.substr(fragment.size()-overlap);
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
...
... or we can use the std::string_view
class to achieve the same effect!
How does this work?
Each std::string
holds information consisting of:
- the size of the string
- a handle to the memory location corresponding to the start of the text
- this is actually a pointer – which we are trying to avoid as much as possible on this course!
↓
(*#&\:;*'¬|\^$(&*^$%("()!`this is the text that our string contains`@~@:~#*(£';~"£*(£|#@
-- Memory is essentially a very long list of 8-bit bytes
- most of these are going to contain data in machine representation (binary)
- none of that will make sense as text
- the text for our string will occupy some contiguous section of memory
- the
std::string
class only needs to know where it starts and how long it is
- the
--
The std::string_view
class holds exactly the same information
The std::string_view
class provides a non-owning (and non-modifiable) view
of a string
Notably, it provides its own substr()
method, which also returns a
std::string_view
- with a different start and size for the substring
- no copying is performed!
--
Original std::string
:
- start at position 1000, size 41 bytes
↓
(*#&\:;*'¬|\^$(&*^$%("()!`this is the text that our string contains`@~@:~#*(£';~"£*(£|#@
The std::string_view
class provides a non-owning (and non-modifiable) view
of a string
Notably, it provides its own substr()
method, which also returns a
std::string_view
- with a different start and size for the substring
- no copying is performed!
Original std::string
:
- start at position 1000, size 41 bytes
↓
(*#&\:;*'¬|\^$(&*^$%("()!this is the text that `our string` contains@~@:~#*(£';~"£*(£|#@
↑
std::string_view
for "our string"
:
- start at position 1022, size 10 bytes
class: section
The main difference between std::string
and std::string_view
is one of
ownership
--
Each std::string
'owns' the memory where its text is
stored, and is responsible for its management
- it needs to allocate the memory when some text needs to be stored
- it needs to free or deallocate that memory when no longer in use, so that the memory can be used for other purposes
--
Each std::string_view
only provides a view into some existing text in
memory
- it does not own that memory, and is not responsible for managing it
--
Consequently, std::string_view
instances are very cheap to create and destroy
- no memory management!
- no copying!
--
But we need to be very careful with the lifetime of that memory location...
When extracting a substring using the original std::string
, we got a new
std::string
- this new
std::string
contained a fully self-contained copy of the corresponding text - its lifetime was not tied to the lifetime of the original
std::string
--
But when we extract a substring using the new std::string_view
, we
get a non-owning view
- the text in that view is completely dependent on the original
std::string
- its lifetime is intrisincally tied to that of the original
std::string
--
There is nothing to prevent the use of a std::string_view
that is no longer
valid!
- no mechanism exists in C++ to prevent this
- if the original
std::string
no longer exists, or points to new location, we will get undefined behaviour- the code may crash, we could get nonsense back, etc.
-- It is up to the developer (you!) to ensure the lifetime of the text exceeds the lifetime of the view
- this is a very common source of bugs and security vulnerabilities
std::string text ("this is the text that our string contains");
std::cout << text << "\n";
auto view = std::string_view (text).substr (22, 10);
std::cout << view << "\n";
text.clear(); // ⇐ original string is no longer valid!
std::cout << view << "\n";
--
If you try this, you may find that the code works just fine
- this is because the memory where the text was stored hasn't been modified
- that memory location has been marked as unused and is available for other uses
- ... but we haven't needed to store anything else there – yet!
--
This is why the result of the last line of code is undefined behaviour
- it may work as expected sometimes, or it may crash other times
- it may depend on what compiler is used, or what settings were used for the compiler, etc.
std::string text ("this is the text that our string contains");
std::cout << text << "\n";
auto view = std::string_view (text).substr (22, 10);
std::cout << view << "\n";
text = "a longer piece of text that won't fit in the original memory location");
// ⇑ forces re-allocation: original string is no longer valid!
std::cout << view << "\n";
--
Assigning a different string to our variable can also force a re-allocation
- particularly if the new string is longer than the original
- there is no guarantee that the new string will be allocated at the same location!
std::string_view get_substr ()
{
std::string original = read_from_file();
return std::string_view (original);
}
// ⇓ original string has gone out of scope and no longer exists!
std::cout << get_substr() << "\n";
--
Returning a non-owning view or reference to a function scope, local variable is a common mistake!
- this applies to any non-owning reference or similar construct
// a vector of 16 integers:
std::vector<int> v = (16, 0);
// we take an iterator to the first element:
auto iter = v.begin();
// we take a reference to the second element:
int& ref (v[1]);
// we append another element to the vector:
v.push_back (1);
// ⇑ this may cause a re-allocation of the contents of the vector
// to a different memory location!
This issue is much broader than just std::string_view
- it affects any constructs that refers to data owned or managed by some other construct
- it can affect references and STL iterators too!
These are examples of a more general class of problems, generally referred to as dangling references
- you will also come across 'dangling pointers', which is essentially the same problem
This applies to references, iterators, pointers and any non-owning construct
--
It is not always obvious when such references become invalidated
- the original owner has been destroyed or gone out of scope
- the original owner has been cleared of its contents
- the original owner has moved its contents to a different location
--
For these reasons, use non-owning constructs with caution!
- in general, they are safe as function arguments
- the owning object lives in a broader scope with a longer lifetime, and will not change while the function runs
- they are not safe as return values
- especially when referring to function scope variables!
- they are not safe when there is a possibility of modification of the original container
`#include <string_view>`
...
int compute_overlap (const std::string& sequence, const std::string& fragment)
{
...
for (int overlap = fragment.size(); overlap > 0; --overlap) {
const auto seq_start = `std::string_view (sequence)`.substr(0, overlap);
const auto frag_end = `std::string_view (fragment)`.substr(fragment.size()-overlap);
if (seq_start == frag_end) {
largest_overlap = overlap;
break;
}
}
...
--
In our case, we can see that:
- the lifetime of both
std::string_view
s is shorter than the originalsequence
&fragments
- there is no scope for our code to modify either
sequence
orfragment
while our views are live- indeed, we enforced this by providing these to
compute_overlap()
asconst
references!
- indeed, we enforced this by providing these to
We can measure the impact of this simple change using the time
command:
Before the change:
$ time ./shotgun ../data/fragments-1.txt out
initial sequence has size 1000
final sequence has length 20108
`real 0m0.359s`
user 0m0.359s
sys 0m0.000s
We can measure the impact of this simple change using the time
command:
After the change:
$ time ./shotgun ../data/fragments-1.txt out
initial sequence has size 1000
final sequence has length 20108
`real 0m0.041s`
user 0m0.037s
sys 0m0.004s
--
This is almost 10× faster!