msg.docs 20.4 KB
Newer Older
Pekka Pessi's avatar
Pekka Pessi committed
1 2
/* -*- text -*- */

3
/**@MODULEPAGE "msg" - Message Parser Module
Pekka Pessi's avatar
Pekka Pessi committed
4 5 6 7

@section msg_meta Module Meta Information

This module contains parser and functions for manipulating messages and
8
headers for text-based protocols like SIP, HTTP or RTSP. It also
Pekka Pessi's avatar
Pekka Pessi committed
9 10 11 12 13
provides parsing of MIME headers and MIME multipart messages common to
these protocols.

@CONTACT Pekka Pessi <Pekka.Pessi@nokia.com>

14
@STATUS @SofiaSIP Core library
Pekka Pessi's avatar
Pekka Pessi committed
15 16 17 18 19 20 21 22 23

@LICENSE LGPL

@par Contributor(s):
- Pekka Pessi <Pekka.Pessi@nokia.com>

@section msg_contents Contents of msg Module

The msg module contains the public header files as follows:
24 25 26 27 28 29 30 31 32 33 34
- <sofia-sip/msg.h>         base message interfaces
- <sofia-sip/msg_types.h>   message and header struct definitions and typedefs
- <sofia-sip/msg_protos.h>  prototypes of header-specific functions for generic headers
- <sofia-sip/msg_header.h>  function prototypes and macros for manipulating message
                            headers
- <sofia-sip/msg_addr.h>    functions for accessing network addresses and I/O vectors
                            associated with the message
- <sofia-sip/msg_date.h>    types and functions for handling dates and times
- <sofia-sip/msg_mime.h>    types, function prototypes and macros for MIME headers
                            and @ref msg_multipart "multipart messages"
- <sofia-sip/msg_mime_protos.h> prototypes of MIME-header-specific functions
Pekka Pessi's avatar
Pekka Pessi committed
35 36 37 38 39 40

In addition to this interface, the @ref msg_parser "parser documentation"
contains description of the functionality required when an existing parser
is extended by a new header or a parser is created for a completely new
protocol. It is possible to add new headers to the parser or extend the
definition of existing ones. The header files used for constructing these
41 42 43
parsers are as follows:
- <sofia-sip/msg_parser.h> parsing functions, macros
- <sofia-sip/msg_mclass.h> message factory object definition
44
- <sofia-sip/msg_mclass_hash.h> hashing of header names
Pekka Pessi's avatar
Pekka Pessi committed
45 46 47 48 49

@section msg_overview Parsers, Messages and Headers

The Sofia @b msg module contains interface to the text-based parsers for
RFC822-like message, the header and message objects. Currently, there
50
are three parsers defined: SIP, HTTP, and MIME.
Pekka Pessi's avatar
Pekka Pessi committed
51 52

The C structure corresponding to each header is defined either in a
53 54 55
<sofia-sip/msg_types.h> or in a protocol-specific header file. These
protocol-specific header files include <sofia-sip/sip.h>, <sofia-sip/http.h>, and
<sofia-sip/msg_mime.h>. For each header, there is defined a @em header @em class
Pekka Pessi's avatar
Pekka Pessi committed
56
structure, some standard functions, and tags for including them in tag
57
lists.
Pekka Pessi's avatar
Pekka Pessi committed
58 59

As a convention, all the identifiers for SIP headers start with prefix @c
60 61 62 63 64 65
sip and all the macros with @c SIP. Same thing holds for HTTP, too: it
uses prefix @c http. However, the MIME headers
and the functions related to them are defined within the @b msg module and
they use prefix @c msg. If a SIP or HTTP header uses a structure
defined in <sofia-sip/msg_types.h>, there is a typedef suitable for the particular
protocol, for example @b Accept header is defined multiple times:
Pekka Pessi's avatar
Pekka Pessi committed
66 67 68 69 70 71 72 73 74 75

@code
typedef struct msg_accept_s sip_accept_t;
typedef struct msg_accept_s http_accept_t;
@endcode

For header @e X of protocol @e NS, there are types, functions, macros and
header class as follows:

 - @c ns_X_t is the structure used to store parsed header,
76
 - @c ns_hclass_t @c ns_X_class[] contains the @em header @em class
Pekka Pessi's avatar
Pekka Pessi committed
77 78 79 80 81
   for header X,
 - @c NS_X_INIT() initializes a static instance of @c ns_X_t,
 - @c ns_X_init() initializes a dynamic instance of @c ns_X_t,
 - @c ns_is_X() tests if header object is instance of header X,
 - @c ns_X_make() creates a header X object by decoding given string,
82
 - @c ns_X_format() creates a header X object by decoding given
Pekka Pessi's avatar
Pekka Pessi committed
83
   @c printf() list,
84 85
 - @c ns_X_dup() duplicates (deeply copies) the header X,
 - @c ns_X_copy() copies the header X,
Pekka Pessi's avatar
Pekka Pessi committed
86
 - @c NSTAG_X() is used include instance of @c ns_X_t in a tag list, and
87
 - @c NSTAG_X_STR() is used to include string containing value header
Pekka Pessi's avatar
Pekka Pessi committed
88 89
      in a tag list.

90 91 92 93 94
The declarations of header tags and the prototypes for these functions can
be imported separately from the type definitions, for instance, the tags
related to SIP headers are declared in the include file
<sofia-sip/sip_tag.h>, and the header-specific functions in
<sofia-sip/sip_header.h>.
Pekka Pessi's avatar
Pekka Pessi committed
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113

@section parser_intro Parsing Text Messages

Sofia text parser follows @em recursive-descent principle.  In other words,
it is a program that descends the syntax tree top-down recursively.
(All syntax trees have root at top and they grow downwards.)

In the case of SIP, HTTP and other similar protocols, such a parser is very
efficient. The parser can choose between different forms based on each
token, as the protocol syntax is carefully designed so that it requires only
minimal scan-ahead. It is also easy to extend a recursive-descent parser via
a standard API, unlike, for instance, a LALR parser generated by @em Bison.

The abstract message module @b msg contains a high-level parser engine that
drives the parsing process and invokes the protocol-specific parser for each
header. As there is no low-layer framing between the RFC822-style messages,
the parser considers any received data, be it a UDP datagram or a TCP
stream, as a @em byte @em stream. The protocol-specific parsers controls how
a byte stream is split into separate messages or if it consists of a single
114
message only.
Pekka Pessi's avatar
Pekka Pessi committed
115 116 117 118 119 120 121 122

The parser engine works by separating stream into fragments, then passing
the fragment to a suitable parser. A fragment is a piece of message that is
parsed during a single step: the first line, each header, the empty line
between headers and message body, the message body. (In case of HTTP, the
message body can consists of multiple fragments known as chunks.)

The parser starts by separating the first line (e.g., request or status
123
line) from the byte stream, then passing the line to the suitable parser.
Pekka Pessi's avatar
Pekka Pessi committed
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
After first line comes the message headers. The parser continues parsing
process by extracting headers, each on their own line, from the stream and
passing contents of each header to its parser. The message structure is
populated based on the parsing results. When an empty line - indicating end
of headers - is encountered, the control is passed to the protocol-specific
parser. Protocol-specific functions take care of extracting the possible
message body from the byte stream.

After parsing process is completed, it can be given to the upper layers
(typically a protocol state machine). The parser continues processing the
stream and feeding the messages to protocol engine until the end of the
stream is reached.

@image html sip-parser.gif Separating byte stream to messages
@image latex sip-parser.eps Separating byte stream to messages

When the parsing process has completed, the first line, each header,
separator and the message body are all in their own fragment structure. The
fragments form a dual-linked list known as @e fragment @e chain as shown in
the above figure. The memory buffers for the message, the fragment chain,
and a whole lot of other stuff is held by the generic message type, #msg_t,
Pekka Pessi's avatar
Pekka Pessi committed
145
defined in <sofia-sip/msg.h>. The internal structure of #msg_t is known only within @b
Pekka Pessi's avatar
Pekka Pessi committed
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
msg module and it is opaque to other modules.

The @b msg parser engine also drives the reverse process, invoking the
encoding method of each fragment so that the whole outgoing message can be
encoded properly.

@section msg_header_struct Message Header as a C struct

Just separating headers from each other and from the message body is not
usually enough. When a header contains structured data, the header contents
should be converted to a form that is convenient to use from C programs. For
that purpose, the message parser needs a parsing function specific to each
individual header. This parsing function divides the contents of the header
into semantically meaningful segments and stores the result in the structure
specific to each header.

The parser engine passes the fragment contents to the parsing function after
it has separated the fragment from the rest of the message. The parser
engine selects correct @e header @e class either by implication (in case of
first line), or it searches for the header class from the hash table using
the header name as the hash key. The @e header @e class contains a pointer
to the parsing function. The parser has also special header classes for
headers with errors and @e unknown headers, header with a name that is not
regocnized by the parser.

For instance, the Accept header has following syntax:
@code
   Accept         = "Accept" ":" #( media-range [ accept-params ] )

   media-range    = ( "*" "/" "*"
                    | ( type "/" "*" )
                    | ( type "/" subtype ) ) *( ";" parameter )

   accept-params  = ";" "q" "=" qvalue *( accept-extension )

   accept-extension = ";" token [ "=" ( token | quoted-string ) ]
@endcode

When an Accept header is parsed, the header parser function (msg_accept_d())
separates the @e type, @e subtype, and each parameter in the list to
strings. The parsing result is assigned to a #msg_accept_t structure, which is
defined as follows:

@code
typedef struct msg_accept_s
{
  msg_common_t        ac_common[1]; //< Common fragment info
  msg_accept_t       *ac_next;	    //< Pointer to next Accept header
  char const         *ac_type;	    //< Pointer to type/subtype
  char const         *ac_subtype;   //< Points after first slash in type
  msg_param_t const  *ac_params;    //< List of parameters
  msg_param_t         ac_q;	    //< Value of q parameter
198
}
Pekka Pessi's avatar
Pekka Pessi committed
199 200 201 202 203 204 205 206 207
msg_accept_t;
@endcode

The string containing the @e type is put into the @c ac_type field, the @e
subtype after slash in the can be found in the @c ac_subtype field, and the
list of @e accept-params (together with media-specific-parameters) is put in
the @c ac_params array. If there is a @e q parameter present, a pointer to
the @c qvalue is assigned to @c ac_q field.

208 209
In the beginning of the header structure there are two boilerplate members.
The @c ac_common[1] contains information common to all message fragments.
Pekka Pessi's avatar
Pekka Pessi committed
210 211 212 213 214 215 216 217 218 219
The @c ac_next is a pointer to next header field with the same name, in case
a message contains multiple @b Accept headers or multiple comma-separated
header fields are located in a single line.

@section msg_object_example Representing a Message as a C struct

It is not enough to represent a message as a list of headers following each
other. The programmer also needs a convenient way to access certain headers
at the message level, for example, accessing directly the @b Accept header
instead of going through all headers and examining their name. The
220
structured view to the message is provided via a message-specific C struct.
Pekka Pessi's avatar
Pekka Pessi committed
221
In general, its type is msg_pub_t (it provides public view to message). The
222 223
protocol-specific type is #sip_t, #http_t or #msg_multipart_t for
SIP, HTTP and MIME, respectively.
Pekka Pessi's avatar
Pekka Pessi committed
224 225 226

So, a single message is represented by two objects, first object (#msg_t) is
private to the @b msg module and opaque by an application programmer, second
227
(#sip_t, #http_t or #msg_multipart_t) is a public protocol-specific
Pekka Pessi's avatar
Pekka Pessi committed
228 229 230 231 232
structure accessible by all.

@note The application programmer can obtain a pointer to the
protocol-specific structure from an #msg_t object using msg_public()
function. The msg_public() takes a protocol tag, a well-known identifier, as
233
its argument. The SIP, HTTP and MIME already define a wrapper around
Pekka Pessi's avatar
Pekka Pessi committed
234 235 236 237 238 239 240 241 242
msg_public(), for example, a #sip_t structure can be obtained with
sip_object() function (or macro).

As an example, the #sip_t structure is defined as follows:
@code
typedef struct sip_s {
  msg_common_t        sip_common[1];    // Used with recursive inclusion
  msg_pub_t          *sip_next;         // Ditto
  void               *sip_user;	        // Application data
243 244 245
  unsigned            sip_size;         // Size of the structure with
                                        // extension headers
  int                 sip_flags;        // Parser flags
Pekka Pessi's avatar
Pekka Pessi committed
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264

  sip_error_t        *sip_error;	// Erroneous headers

  sip_request_t      *sip_request;      // Request line
  sip_status_t       *sip_status;       // Status line

  sip_via_t          *sip_via;          // Via (v)
  sip_route_t        *sip_route;        // Route
  sip_record_route_t *sip_record_route; // Record-Route
  sip_max_forwards_t *sip_max_forwards; // Max-Forwards
  ...
} sip_t;
@endcode

As you can see above, the public #sip_t structure contains the common
header members that are also found in the beginning of a header
structure. The @e sip_size indicates the size of the structure - the
application can extend the parser and #sip_t structure beyond the
original size. The @e sip_flags contains various flags used during the
Pekka Pessi's avatar
Pekka Pessi committed
265
parsing and printing process. They are documented in the <sofia-sip/msg.h>. These
Pekka Pessi's avatar
Pekka Pessi committed
266 267 268 269 270 271
boilerplate members are followed by the pointers to various message
elements and headers.

@section msg_parsing_example Result of Parsing Process

Let us now show how a simple message is parsed and presented to the
272 273
applications. As an exampe, we choose a SIP request message with method BYE,
including only the mandatory fields:
Pekka Pessi's avatar
Pekka Pessi committed
274 275 276 277 278 279 280 281 282 283 284 285
@code
BYE sip:joe@example.com SIP/2.0
Via: SIP/2.0/UDP sip.example.edu;branch=d7f2e89c.74a72681
Via: SIP/2.0/UDP pc104.example.edu:1030;maddr=110.213.33.19
From: Bobby Brown <sip:bb@example.edu>;tag=77241a86
To: Joe User <sip:joe@example.com>;tag=7c6276c1
Call-ID: 4c4e911b@pc104.example.edu
CSeq: 2
@endcode

The figure below shows the layout of the BYE message above after parsing:

286
@image html sip-parser2.gif BYE message and its representation in C
Pekka Pessi's avatar
Pekka Pessi committed
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433
@image latex sip-parser2.eps BYE message and its representation in C

The leftmost box represents the message of type #msg_t.  Next box from
the left reprents the #sip_t structure, which contains pointers to a
header objects.  The next column contains the header objects.  There is
one header object for each message fragment. The rightmost box represents
the I/O buffer used when the message was received.  Note that the I/O
buffer may be non-continous and composed of many separate memory areas.

The message object has link to the public message structure (@a
m_object), to the dual-linked fragment chain (@a m_frags) and to the I/O
buffer (@a m_buffer).  The public message structure contains pointers to
the headers according to their type.  If there are multiple headers of
the same type (like there are two Via headers in the above message), the
headers are put into a single-linked list.

Each fragment has pointers to successing and preceding fragment. It also
contains pointer to the corresponding data within the I/O buffer and its
length.

The main purpose of the fragment chain is to preserve the original order
of the headers.  If there were an third Via header after CSeq in the
message, the fragment representing it would be after the CSeq header in
the fragment chain but after second Via in the header list.

@section msg_parsing_memory Example: Parsing a Complete Message

The following code fragment is an example of parsing a complete message. The
parsing process is more hairy when there is stream to be parsed.

@code
msg_t *parse_memory(msg_mclass_t const *mclass, char const data[], int len)
{
  msg_t *msg;
  int m;
  msg_iovec_t iovec[2] = {{ 0 }};

  msg  = msg_create(mclass, 0);
  if (!msg)
    return NULL;

  m = msg_recv_iovec(msg, iovec, 2, n, 1);
  if (m < 0) {
    msg_destroy(msg);
    return NULL;
  }
  assert(m <= 2);
  assert(iovec[0].mv_len + iovec[1].mv_len == n);

  memcpy(iovec[0].mv_base, data, n = iovec[0].mv_len);
  if (m == 2)
    memcpy(iovec[1].mv_base + n, data + n, iovec[1].mv_len);

  msg_recv_commit(msg, iovec[0].mv_len + iovec[1].mv_len, 1);

  m = msg_extract(msg);
  assert(m != 0);
  if (m < 0) {
     msg_destroy(msg);
     return NULL;
  }
  return msg;
}
@endcode

Let's go through this simple function, step by step. First, we get the @a
data pointer and its size in bytes, @a len. We first initialize an I/O
vector used to represent message with the parser.

@code
msg_t *parse_memory(msg_mclass_t const *mclass, char const data[], int len)
{
  msg_t *msg;
  int m;
  msg_iovec_t iovec[2] = {{ 0 }};
@endcode

The message class @a mclass (a parser driver object, #msg_mclass_t) is used
to represent a particular protocol-specific parser instance. When a message
object is created, it is given as an argument to msg_create() function:

@code
  msg  = msg_create(mclass, 0);
  if (!msg)
    return NULL;
@endcode

Next we obtain a memory buffer for data with msg_recv_iovec(). The memory
buffer is usually a single continous memory area, but in some cases it may
consist of two distinct areas. Therefore the @a iovec is used here to pass
the buffers around. The @a iovec is also very handly as it can be directly
passed to various system I/O calls.

@code
  m = msg_recv_iovec(msg, iovec, 2, n, 1);
  if (m < 0) {
    msg_destroy(msg);
    return NULL;
  }
@endcode

These assumptions hold always true when you call msg_recv_iovec() first
time with a complete message:

@code
  assert(m >= 1 && m <= 2);
  assert(iovec[0].mv_len + iovec[1].mv_len == n);
@endcode

Next, we copy the data to the I/O vector and commit the copied data to the
message. Earlier with msg_recv_iovec() we allocated buffer space for data,
now calling msg_recv_commit() indicates that valid data has been copied to
the buffer. The last parameter to msg_recv_commit() indicates that the end
of stream is encountered and no more data is to be expected.

@code
  memcpy(iovec[0].mv_base, data, n = iovec[0].mv_len);
  if (m == 2)
    memcpy(iovec[1].mv_base + n, data + n, iovec[1].mv_len);

  msg_recv_commit(msg, iovec[0].mv_len + iovec[1].mv_len, 1);
@endcode

We call msg_extract() next; it takes care of parsing the message. A fatal
parsing error is indicated by returning -1. If the message is incomplete,
msg_extract() returns 0. When a complete message has been parsed, a positive
value is returned. We know that a message cannot be incomplete, as a call to
msg_recv_commit() indicated to the parser that the end-of-stream has been
encountered.

@code
  m = msg_extract(msg);
  assert(m != 0);
  if (m < 0) {
     msg_destroy(msg);
     return NULL;
  }
  return msg;
}
@endcode

 */

/**@class msg_s msg.h
 *
 * @brief Message object.
 *
434
 * The message object is used by Sofia parsers for SIP and HTTP
Pekka Pessi's avatar
Pekka Pessi committed
435 436
 * protocols. The message object has an abstract, protocol-independent
 * inteface type #msg_t, and a separate public protocol-specific interface
437
 * #msg_pub_t (which is typedef'ed to #sip_t or #http_t depending
Pekka Pessi's avatar
Pekka Pessi committed
438 439
 * on the protocol).
 *
440
 * The main interface to abstract messages is defined in <sofia-sip/msg.h>. The
Pekka Pessi's avatar
Pekka Pessi committed
441
 * network I/O interface used by transport protocols is defined in
442 443
 * <sofia-sip/msg_addr.h>. The protocol-specific parser table, also known as message
 * class, is defined in <sofia-sip/msg_mclass.h>. (The message class is used as a
Pekka Pessi's avatar
Pekka Pessi committed
444 445 446 447 448
 * factory object when a message object is created with msg_create()).
 */

/**@typedef typedef struct msg_s msg_t;
 *
449
 * Message object.
Pekka Pessi's avatar
Pekka Pessi committed
450 451
 *
 * The @a msg_t is the type of a message object used by Sofia signaling
452 453 454 455 456
 * protocols and parsers. Its contents are not directly accessible.
 */

/**@typedef typedef struct msg_common_s msg_common_t;
 *
457
 * Common part of header.
458 459 460 461
 *
 * The @a msg_common_t is the base type of a message headers used by
 * protocol parsers. Instead of #msg_common_t, most interfaces use
 * #msg_header_t, which is supposed to be a union of all possible headers.
Pekka Pessi's avatar
Pekka Pessi committed
462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483
 */


/**
 * @defgroup msg_parser Parser Building Blocks
 *
 * This submodule contains the functions and types for building a
 * protocol-specific parser.
 */

/**
 * @defgroup msg_headers Headers
 *
 * This submodule contains the functions and types for handling message
 * headers and other elements.
 */


/**
 * @defgroup msg_mime MIME Headers
 *
 * This submodule contains the header classes, functions and types for
484
 * handling MIME headers (@RFC2045) and MIME multipart (@RFC2046) processing.
Pekka Pessi's avatar
Pekka Pessi committed
485 486
 *
 * The MIME headers implemented are as follows:
487 488 489 490 491 492 493 494 495 496 497 498
 * - @ref msg_accept "@b Accept header"
 * - @ref msg_accept_charset "@b Accept-Charser header"
 * - @ref msg_accept_encoding "@b Accept-Encoding header"
 * - @ref msg_accept_language "@b Accept-Language header"
 * - @ref msg_content_disposition "@b Content-Disposition header"
 * - @ref msg_content_encoding "@b Content-Encoding header"
 * - @ref msg_content_id "@b Content-ID header"
 * - @ref msg_content_location "@b Content-Location header"
 * - @ref msg_content_language "@b Content-Language header"
 * - @ref msg_content_md5 "@b Content-MD5 header"
 * - @ref msg_content_transfer_encoding "@b Content-Transfer-Encoding header"
 * - @ref msg_mime_version "@b MIME-Version header"
Pekka Pessi's avatar
Pekka Pessi committed
499 500 501
 */

/**
502
 * @defgroup test_msg Testing Parser
Pekka Pessi's avatar
Pekka Pessi committed
503 504 505 506
 *
 * This submodule contains the functions and types for building a
 * parser objects for testing purposes.
 */