API reference

This chapter contains detailed API documentation for HappyBase. It is suggested to read the user guide first to get a general idea about how HappyBase works.

The HappyBase API is organised as follows:

Connection:
The Connection class is the main entry point for application developers. It connects to the HBase Thrift server and provides methods for table management.
Table:
The Table class is the main class for interacting with data in tables. This class offers methods for data retrieval and data manipulation. Instances of this class can be obtained using the Connection.table() method.
Batch:
The Batch class implements the batch API for data manipulation, and is available through the Table.batch() method.
ConnectionPool:
The ConnectionPool class implements a thread-safe connection pool that allows an application to (re)use multiple connections.

Connection

class happybase.Connection(host='localhost', port=9090, timeout=None, autoconnect=True, table_prefix=None, table_prefix_separator='_', compat='0.96', transport='buffered')

Connection to an HBase Thrift server.

The host and port arguments specify the host name and TCP port of the HBase Thrift server to connect to. If omitted or None, a connection to the default port on localhost is made. If specifed, the timeout argument specifies the socket timeout in milliseconds.

If autoconnect is True (the default) the connection is made directly, otherwise Connection.open() must be called explicitly before first use.

The optional table_prefix and table_prefix_separator arguments specify a prefix and a separator string to be prepended to all table names, e.g. when Connection.table() is invoked. For example, if table_prefix is myproject, all tables tables will have names like myproject_XYZ.

The optional compat argument sets the compatibility level for this connection. Older HBase versions have slightly different Thrift interfaces, and using the wrong protocol can lead to crashes caused by communication errors, so make sure to use the correct one. This value can be either the string 0.90, 0.92, 0.94, or 0.96 (the default).

The optional transport argument specifies the Thrift transport mode to use. Supported values for this argument are buffered (the default) and framed. Make sure to choose the right one, since otherwise you might see non-obvious connection errors or program hangs when making a connection. HBase versions before 0.94 always use the buffered transport. Starting with HBase 0.94, the Thrift server optionally uses a framed transport, depending on the argument passed to the hbase-daemon.sh start thrift command. The default -threadpool mode uses the buffered transport; the -hsha, -nonblocking, and -threadedselector modes use the framed transport.

New in version 0.5: timeout argument

New in version 0.4: table_prefix_separator argument

New in version 0.4: support for framed Thrift transports

Parameters:
  • host (str) – The host to connect to
  • port (int) – The port to connect to
  • timeout (int) – The socket timeout in milliseconds (optional)
  • autoconnect (bool) – Whether the connection should be opened directly
  • table_prefix (str) – Prefix used to construct table names (optional)
  • table_prefix_separator (str) – Separator used for table_prefix
  • compat (str) – Compatibility mode (optional)
  • transport (str) – Thrift transport mode (optional)
close()

Close the underyling transport to the HBase instance.

This method closes the underlying Thrift transport (TCP connection).

compact_table(name, major=False)

Compact the specified table.

Parameters:
  • name (str) – The table name
  • major (bool) – Whether to perform a major compaction.
create_table(name, families)

Create a table.

Parameters:
  • name (str) – The table name
  • families (dict) – The name and options for each column family

The families argument is a dictionary mapping column family names to a dictionary containing the options for this column family, e.g.

families = {
    'cf1': dict(max_versions=10),
    'cf2': dict(max_versions=1, block_cache_enabled=False),
    'cf3': dict(),  # use defaults
}
connection.create_table('mytable', families)

These options correspond to the ColumnDescriptor structure in the Thrift API, but note that the names should be provided in Python style, not in camel case notation, e.g. time_to_live, not timeToLive. The following options are supported:

  • max_versions (int)
  • compression (str)
  • in_memory (bool)
  • bloom_filter_type (str)
  • bloom_filter_vector_size (int)
  • bloom_filter_nb_hashes (int)
  • block_cache_enabled (bool)
  • time_to_live (int)
delete_table(name, disable=False)

Delete the specified table.

New in version 0.5: disable argument

In HBase, a table always needs to be disabled before it can be deleted. If the disable argument is True, this method first disables the table if it wasn’t already and then deletes it.

Parameters:
  • name (str) – The table name
  • disable (bool) – Whether to first disable the table if needed
disable_table(name)

Disable the specified table.

Parameters:name (str) – The table name
enable_table(name)

Enable the specified table.

Parameters:name (str) – The table name
is_table_enabled(name)

Return whether the specified table is enabled.

Parameters:name (str) – The table name
Returns:whether the table is enabled
Return type:bool
open()

Open the underlying transport to the HBase instance.

This method opens the underlying Thrift transport (TCP connection).

table(name, use_prefix=True)

Return a table object.

Returns a happybase.Table instance for the table named name. This does not result in a round-trip to the server, and the table is not checked for existence.

The optional use_prefix argument specifies whether the table prefix (if any) is prepended to the specified name. Set this to False if you want to use a table that resides in another ‘prefix namespace’, e.g. a table from a ‘friendly’ application co-hosted on the same HBase instance. See the table_prefix argument to the Connection constructor for more information.

Parameters:
  • name (str) – the name of the table
  • use_prefix (bool) – whether to use the table prefix (if any)
Returns:

Table instance

Return type:

Table

tables()

Return a list of table names available in this HBase instance.

If a table_prefix was set for this Connection, only tables that have the specified prefix will be listed.

Returns:The table names
Return type:List of strings

Table

class happybase.Table(name, connection)

HBase table abstraction class.

This class cannot be instantiated directly; use Connection.table() instead.

batch(timestamp=None, batch_size=None, transaction=False, wal=True)

Create a new batch operation for this table.

This method returns a new Batch instance that can be used for mass data manipulation. The timestamp argument applies to all puts and deletes on the batch.

If given, the batch_size argument specifies the maximum batch size after which the batch should send the mutations to the server. By default this is unbounded.

The transaction argument specifies whether the returned Batch instance should act in a transaction-like manner when used as context manager in a with block of code. The transaction flag cannot be used in combination with batch_size.

The wal argument determines whether mutations should be written to the HBase Write Ahead Log (WAL). This flag can only be used with recent HBase versions. If specified, it provides a default for all the put and delete operations on this batch. This default value can be overridden for individual operations using the wal argument to Batch.put() and Batch.delete().

New in version 0.7: wal argument

Parameters:
  • transaction (bool) – whether this batch should behave like a transaction (only useful when used as a context manager)
  • batch_size (int) – batch size (optional)
  • timestamp (int) – timestamp (optional)
  • bool (wal) – whether to write to the WAL (optional)
Returns:

Batch instance

Return type:

Batch

cells(row, column, versions=None, timestamp=None, include_timestamp=False)

Retrieve multiple versions of a single cell from the table.

This method retrieves multiple versions of a cell (if any).

The versions argument defines how many cell versions to retrieve at most.

The timestamp and include_timestamp arguments behave exactly the same as for row().

Parameters:
  • row (str) – the row key
  • column (str) – the column name
  • versions (int) – the maximum number of versions to retrieve
  • timestamp (int) – timestamp (optional)
  • include_timestamp (bool) – whether timestamps are returned
Returns:

cell values

Return type:

list of values

counter_dec(row, column, value=1)

Atomically decrement (or increments) a counter column.

This method is a shortcut for calling Table.counter_inc() with the value negated.

Returns:counter value after decrementing
Return type:int
counter_get(row, column)

Retrieve the current value of a counter column.

This method retrieves the current value of a counter column. If the counter column does not exist, this function initialises it to 0.

Note that application code should never store a incremented or decremented counter value directly; use the atomic Table.counter_inc() and Table.counter_dec() methods for that.

Parameters:
  • row (str) – the row key
  • column (str) – the column name
Returns:

counter value

Return type:

int

counter_inc(row, column, value=1)

Atomically increment (or decrements) a counter column.

This method atomically increments or decrements a counter column in the row specified by row. The value argument specifies how much the counter should be incremented (for positive values) or decremented (for negative values). If the counter column did not exist, it is automatically initialised to 0 before incrementing it.

Parameters:
  • row (str) – the row key
  • column (str) – the column name
  • value (int) – the amount to increment or decrement by (optional)
Returns:

counter value after incrementing

Return type:

int

counter_set(row, column, value=0)

Set a counter column to a specific value.

This method stores a 64-bit signed integer value in the specified column.

Note that application code should never store a incremented or decremented counter value directly; use the atomic Table.counter_inc() and Table.counter_dec() methods for that.

Parameters:
  • row (str) – the row key
  • column (str) – the column name
  • value (int) – the counter value to set
delete(row, columns=None, timestamp=None, wal=True)

Delete data from the table.

This method deletes all columns for the row specified by row, or only some columns if the columns argument is specified.

Note that, in many situations, batch() is a more appropriate method to manipulate data.

New in version 0.7: wal argument

Parameters:
  • row (str) – the row key
  • columns (list_or_tuple) – list of columns (optional)
  • timestamp (int) – timestamp (optional)
  • bool (wal) – whether to write to the WAL (optional)
families()

Retrieve the column families for this table.

Returns:Mapping from column family name to settings dict
Return type:dict
put(row, data, timestamp=None, wal=True)

Store data in the table.

This method stores the data in the data argument for the row specified by row. The data argument is dictionary that maps columns to values. Column names must include a family and qualifier part, e.g. cf:col, though the qualifier part may be the empty string, e.g. cf:.

Note that, in many situations, batch() is a more appropriate method to manipulate data.

New in version 0.7: wal argument

Parameters:
  • row (str) – the row key
  • data (dict) – the data to store
  • timestamp (int) – timestamp (optional)
  • bool (wal) – whether to write to the WAL (optional)
regions()

Retrieve the regions for this table.

Returns:regions for this table
Return type:list of dicts
row(row, columns=None, timestamp=None, include_timestamp=False)

Retrieve a single row of data.

This method retrieves the row with the row key specified in the row argument and returns the columns and values for this row as a dictionary.

The row argument is the row key of the row. If the columns argument is specified, only the values for these columns will be returned instead of all available columns. The columns argument should be a list or tuple containing strings. Each name can be a column family, such as cf1 or cf1: (the trailing colon is not required), or a column family with a qualifier, such as cf1:col1.

If specified, the timestamp argument specifies the maximum version that results may have. The include_timestamp argument specifies whether cells are returned as single values or as (value, timestamp) tuples.

Parameters:
  • row (str) – the row key
  • columns (list_or_tuple) – list of columns (optional)
  • timestamp (int) – timestamp (optional)
  • include_timestamp (bool) – whether timestamps are returned
Returns:

Mapping of columns (both qualifier and family) to values

Return type:

dict

rows(rows, columns=None, timestamp=None, include_timestamp=False)

Retrieve multiple rows of data.

This method retrieves the rows with the row keys specified in the rows argument, which should be should be a list (or tuple) of row keys. The return value is a list of (row_key, row_dict) tuples.

The columns, timestamp and include_timestamp arguments behave exactly the same as for row().

Parameters:
  • rows (list) – list of row keys
  • columns (list_or_tuple) – list of columns (optional)
  • timestamp (int) – timestamp (optional)
  • include_timestamp (bool) – whether timestamps are returned
Returns:

List of mappings (columns to values)

Return type:

list of dicts

scan(row_start=None, row_stop=None, row_prefix=None, columns=None, filter=None, timestamp=None, include_timestamp=False, batch_size=1000, scan_batching=None, limit=None, sorted_columns=False)

Create a scanner for data in the table.

This method returns an iterable that can be used for looping over the matching rows. Scanners can be created in two ways:

  • The row_start and row_stop arguments specify the row keys where the scanner should start and stop. It does not matter whether the table contains any rows with the specified keys: the first row after row_start will be the first result, and the last row before row_stop will be the last result. Note that the start of the range is inclusive, while the end is exclusive.

    Both row_start and row_stop can be None to specify the start and the end of the table respectively. If both are omitted, a full table scan is done. Note that this usually results in severe performance problems.

  • Alternatively, if row_prefix is specified, only rows with row keys matching the prefix will be returned. If given, row_start and row_stop cannot be used.

The columns, timestamp and include_timestamp arguments behave exactly the same as for row().

The filter argument may be a filter string that will be applied at the server by the region servers.

If limit is given, at most limit results will be returned.

The batch_size argument specifies how many results should be retrieved per batch when retrieving results from the scanner. Only set this to a low value (or even 1) if your data is large, since a low batch size results in added round-trips to the server.

The optional scan_batching is for advanced usage only; it translates to Scan.setBatching() at the Java side (inside the Thrift server). By setting this value rows may be split into partial rows, so result rows may be incomplete, and the number of results returned by te scanner may no longer correspond to the number of rows matched by the scan.

If sorted_columns is True, the columns in the rows returned by this scanner will be retrieved in sorted order, and the data will be stored in OrderedDict instances.

Compatibility notes:

  • The filter argument is only available when using HBase 0.92 (or up). In HBase 0.90 compatibility mode, specifying a filter raises an exception.
  • The sorted_columns argument is only available when using HBase 0.96 (or up).

New in version 0.8: sorted_columns argument

New in version 0.8: scan_batching argument

Parameters:
  • row_start (str) – the row key to start at (inclusive)
  • row_stop (str) – the row key to stop at (exclusive)
  • row_prefix (str) – a prefix of the row key that must match
  • columns (list_or_tuple) – list of columns (optional)
  • filter (str) – a filter string (optional)
  • timestamp (int) – timestamp (optional)
  • include_timestamp (bool) – whether timestamps are returned
  • batch_size (int) – batch size for retrieving resuls
  • scan_batching (bool) – server-side scan batching (optional)
  • limit (int) – max number of rows to return
  • sorted_columns (bool) – whether to return sorted columns
Returns:

generator yielding the rows matching the scan

Return type:

iterable of (row_key, row_data) tuples

Batch

class happybase.Batch(table, timestamp=None, batch_size=None, transaction=False, wal=True)

Batch mutation class.

This class cannot be instantiated directly; use Table.batch() instead.

delete(row, columns=None, wal=None)

Delete data from the table.

See Table.put() for a description of the row, data, and wal arguments. The wal argument should normally not be used; its only use is to override the batch-wide value passed to Table.batch().

put(row, data, wal=None)

Store data in the table.

See Table.put() for a description of the row, data, and wal arguments. The wal argument should normally not be used; its only use is to override the batch-wide value passed to Table.batch().

send()

Send the batch to the server.

Connection pool

class happybase.ConnectionPool(size, **kwargs)

Thread-safe connection pool.

New in version 0.5.

The size argument specifies how many connections this pool manages. Additional keyword arguments are passed unmodified to the happybase.Connection constructor, with the exception of the autoconnect argument, since maintaining connections is the task of the pool.

Parameters:
  • size (int) – the maximum number of concurrently open connections
  • kwargs – keyword arguments passed to happybase.Connection
connection(*args, **kwds)

Obtain a connection from the pool.

This method must be used as a context manager, i.e. with Python’s with block. Example:

with pool.connection() as connection:
    pass  # do something with the connection

If timeout is specified, this is the number of seconds to wait for a connection to become available before NoConnectionsAvailable is raised. If omitted, this method waits forever for a connection to become available.

Parameters:timeout (int) – number of seconds to wait (optional)
Returns:active connection from the pool
Return type:happybase.Connection
class happybase.NoConnectionsAvailable

Exception raised when no connections are available.

This happens if a timeout was specified when obtaining a connection, and no connection became available within the specified timeout.

New in version 0.5.