2008/03/25

Socket Overview

A TCP/IP socket is a bi-directional stream between two services.

 Server                Client
--------              --------
 socket                socket
 bind
 listen
   accept   ---------  connect
   | recv*  <--------  send*
   | send*  -------->  recv*
   | close  ---------  close
 close
socket
Gets socket resources (file descriptor, buffers).
Specifies the protocol family.
bind
Gives the socket a local address (port, IP address).
listen
Establish that this socket will be a server socket (a/k/a a welcome socket) for accepting incoming connections.
Sets the length of the listen queue (the maximum number of waiting incoming connections).
accept
Receive an incoming connection.
Returns a new socket connected to the client.
connect
Connect a client socket to a remote host.
send / recv
Send and receive data over the connected socket stream.
close
Close the socket.

See: Beej's Networking Guide (especially sections 4 and 5).

2008/03/18

Files in C

A file is a flat representation of data as an ordered sequence of bytes.

Files are written to and read from as a stream, but a file handle can have the concept of a file cursor. The file cursor represents the position in the file of the "next" byte to be read or written.

In Java, the use of the file cursor for random access to a file is discouraged (seeking is fairly expensive, and often leads to errors). Most file I/O happens strictly through the stream abstraction. In C, it's a lot easier to seek backwards and forwards. There are methods to move the file cursor to different locations in the file, but reading and writing still happens through the stream abstraction. When you read or write a byte, the file cursor is automatically advanced one byte.

C's I/O libraries are set up to be type agnostic so that you can read or write any type to/from the stream without excessive overloading (overloading is actually not allowed in C). They work by giving the I/O function a pointer to a buffer of unknown type (a void*) and telling the function how large that buffer is and how many elements it contains.

Reading and writing a single integer value looks like this:

int val;
fread(&val, sizeof(val), 1, f_in);
fwrite(&val, sizeof(val), 1, f_out);

Reading and writing a struct looks like this:

struct AStructType val;
fread(&val, sizeof(val), 1, f_in);
fwrite(&val, sizeof(val), 1, f_out);

Reading and writing an array looks like this:

int[10] val;
fread(val, sizeof(int), 10, f_in);
fwrite(val, sizeof(int), 10, f_out);


Because the read and write values take pointers to buffers, you cannot write any temporary values to a stream.

frwite(atoi("7"), sizeof(int), 1, f);

The above example will generate a compile warning and segfault because the value returned from atoi() is the value 7 and not a pointer to the value 7. The compiler warns us that we are trying to convert an integer to a pointer without a cast, and it segfaults because fwrite tries to read what is stored at memory location 7 (which is not a valid pointer address).

frwite(&atoi("7"), sizeof(int), 1, f);

The above example does not compile because we are trying to take the address of a temporary value. This value may not have an address at all, it may only exist in a register, so we cannot take its address.

int val = atoi("7");
frwite(&val, sizeof(int), 1, f);

Corrects the problem and works as intended.

See: stdio.h (cppreference.com, cplusplus.com)

Stream Abstraction

A common abstraction used for I/O is a "stream." A stream is a reliable connection between two components through which one device can send data to the other. Streams usually operate on bytes (and are called byte streams).

A stream gets it's name because it is like a river of data. The data length is unknown, and you cannot randomly access elements in the stream. When data is added to the stream it flows down the river. You are only allowed to insert data at the start of the stream -- you cannot change it, remove it, or insert something ahead of it -- once it has gone into the stream, it is out of your control. When you remove data from a stream, you can only remove it from the end. You cannot "go back" in the stream (doing so would be the same as pushing a river backwards) and you cannot read a value later in the stream (without reading the data earlier in the stream). The data pours out of the end, and once it is removed, it is no longer part of the stream.

Imagine a stream as a river, and your program is using it to send/receive messages. Putting data in the stream is the same as putting a message into a bottle and floating it down the river. Taking data from the stream is the same as removing a bottle from the river. However, in this case the river is a perfect delivery mechanism. Bottles always arrive in the same order that they were sent, they are never lost (or missed), and there are no gaps between the bottles.

Output streams are occasionally called a "sink," whereas input streams are sometimes called a "source."

Streams are a useful abstraction because they hide the underlying transport mechanism used between components, and they hide the details of the component at the other end. The stream could be a network socket, a pipe to another program, a terminal, a keyboard, a disk, a printer, a modem, a billboard, or a buffer in memory. Programs written to use streams as their I/O can be easily adapted to substitute one implementation of a stream for another.

See: Java's InputStream, OutputStream, C++'s iostream, and C's I/O with streams (stdio.h).

2008/03/06

Using GDB

GDB (project, documentation) is a simple debugging tool for GNU command line systems.

You can Google around for lots of information about it. A short tutorial, a longer tutorial, and a reference card are pretty easy to find.

For GDB to be most effective, you should compile your code with the -g flag to leave debugging symbols in your executable. This will do things like leave variable names and line numbers in the executable so that GDB can show you what/were things are.

The commands that you will use the most are:

help
r arguments (run)
r 5
file executable
bt (backtrace)
p expression (print)
p argv[1]
b file:symbol (breakpoint)
b main.c:5
b main.c:parseArgs
d (delete breakpoints)
s (step) [step-into]
n (next) [step-over]
c (continue)
up
q (quit)

Segmentation Faults and Bus Errors

A segmentation fault occurs when your program tries to access a region of memory in a way that is not permitted. A bus error occurs when your program tries to access a region of memory that does not exist, or it tries to access an unaligned region. See the wiki pages for examples.

SegFualts and BusErrors are runtime errors that come from accessing memory through pointers. Java has protected you from these kinds of errors by abstracting pointers away from the programing language. The only thing in Java that is similar is an ArrayIndexOutOfBounds exception that is raised when you try to access an element that is not contained in an array. In C, buffer accesses are unchecked, so this may or may not raise a segmentation fault depending on how memory has been laid out by the compiler and OS. [Note: unchecked buffers are arguably the largest source of security bugs in C/C++.]

Consider the code:
int buff[3];
buff[3] = 8;

In Java, this would throw an ArrayIndexOutOfBounds. In C, it may or may not segfault.

If the buffer buff was placed in memory next to a memory page boundary it has a greater chance of raising a segfault. The buffer buff ends at buff[2]; or rather buff has been reserved space for 3 elements. If buff[3] (or more precisely, *(buff+3), the 4th element) was on a read only page, then the assignment will generate a segfault (but a read would not). If it was on a page marked "no access," then either an attempt to read or to write would raise an error. If it was on a page that is marked as read/write, then no fault signal will be raised and the value 8 will overwrite whatever value was previously at that location. This may be empty memory (so nothing bad will happen), another variable (the old variable contents will be lost), part of the call stack (possibly causing the program to "return" to an arbitrary location), or even a piece of instruction code. If the value 8 was inputed from the user, this bug could allow a user to execute arbitrary code within the program.

In Java, these errors are always fail-stop, so they are easier to find. In C, they may or may not be fail-stop depending on how memory is arranged, so they are harder to find.

Bonus Material: The libraries Electric Fence and Dmalloc change the way in which pages are allocated to ensure that each buffer (in the heap) ends at a page boundary and that the following page is "no access." The tool Valgrind translates and instruments code to look for similar (and more) errors, and Die Hard (acm, draft) takes a probabilistic approach for laying out memory to detect these errors.

2008/03/04

Bonus: Automated Testing

One of the principles of agile development is that your code should have completely automated unit tests. Your tests should require no interaction on your part, and they should require no interpretation; they should simply run and say pass or fail. If your tests require input, then they are not repeatable. If they require interpretation, then they are not automated.

By building an automated regression suite, you can use the tests as an enabler of change to allow you to test every change that you make to your code-base with little effort. Ideally, you would make running tests as part of your build process. In most agile environments, the automated execution of tests is at least part of the integration process (when you commit your code changes to the repository).

There are many testing tools available for agile developers. The most popular is the xUnit framework. (Java: JUnit, .NET: NUnit, C++: CppUnit). These tools may be too heavyweight to learn just for testing your cs213 assignment, so we're going to see how to write an automated testing framework using some more basic primitives. The C language includes a basic assertion package <assert.h>.

Consider the file:
testable.c:
#include <stdbool.h>
#include <assert.h>

int isBiggerThanFive(const int x) {
    return x > 5;
}

int main() {

    assert(true == isBiggerThanFive(10));
    assert(true == isBiggerThanFive(6));
    assert(false == isBiggerThanFive(5));
    assert(false == isBiggerThanFive(4));
    assert(false == isBiggerThanFive(-10));

    return 0;
}

Compiling and running this file with gcc -Wall testable.c -o testable; ./testable produces no output since all of the tests pass.

[Note: an assertion is a statement that something is believed to be true, so all of your assertions should pass.]

If we add the incorrect assertion that assert(false == isBiggerThanFive(50)); to the bottom of our method, we get the output:

testable: testable.c:15: main: Assertion `0 == isBiggerThanFive(50)' failed.

This is telling us that for the executable testable, in the file testable.c, on line 15, in method main, our assertion that false == isBiggerThanFive(50) failed. In this case, it's because our test is invalid, but it could have just as easily been an error in our implementation.

If we add another incorrect assertion to our test method true == isBiggerThanFive(0) and re-build, we still get the same output. There is no output from the second test, even though it should fail. This is the expected behavior.

In unit testing, it's desired that a test harness fails on the first error for each test. There is no sense in continuing to run a test after a failure since it means that the unit under test (or the test itself) has entered an unknown state, and therefore we can no longer make assertions (our belief was wrong).

Failing an assert in C causes the program to abort execution. This has the unfortunate effect that only one unit test can fail per executable (subsequent tests will not run if there was a failure). The xUnit testing frameworks handle testing multiple units better, but assert is good enough for testing small programs.

If you don't want to use the assert library, you can write your test harnesses to use the printf function, but all that you should be printing is pass or fail [details]; don't print things that require you to read them to verify that they are correct, as this will destroy any chance of test automation.

See also:

2008/02/26

Make

A make file consists of a set of make rules. Each rule has three parts: a target, a set of zero or more dependencies, and a list of zero or more commands.

target: dependencies
(tab) command1
(tab) command2

Make is invoked by specifying a target on the command line. If no target is given, then the first target in the file is taken as the default. To compete a target, all dependencies are evaluated first. If a dependency is another target in the makefile, then those dependent targets are evaluated. If the dependency is not a target in the makefile, then the file system is checked for the existence of a file with the name of the dependency. If the file has been modified more recently than the file matching the target, then the commands for that target are run.

Consider the files:
main.c:
  #include "bar.h"
  void main() {
    bar();
  }
bar.c:
  #include "bar.h"
  void bar() {
  }
bar.h:
  void bar();
The corresponding naive makefile would be:
main.exe: main.o bar.o
(tab) gcc main.o bar.o -o main.exe

main.o: main.c bar.h
(tab) gcc -c man.c

bar.o: bar.c bar.h
(tab) gcc -c bar.c

clean:
(tab) rm *.o main.exe

Note: the "-o" flag specifies the name of the output file. The "-c" flag tells the compiler to stop after the "compile" stage and before the "link" stage, generating an object file as the output.

Typing "make" on the command line would build this project and produce an executable named "main.exe". Typing "make" again would do nothing since all of the dependencies would evaluate to false since none of the source files had changed. Changing any of the source files and then typing "make" would cause only the required segments of the project to be recompiled. The make utility knows what files need to be rebuilt based on the dependencies that we have given it, so it is important that these dependencies are correct.

See also: GNU make documentation.