When serving a TensorFlow model from Python, most people will call the following Python API:

sess.run(...) 

This is the absolute slowest way to invoke a TensorFlow prediction.  

There is a lot of overhead when calling the C++ native TensorFlow engine from a higher level language like Python.  

In this case, session.run() incurs approximately 120 microseconds per call.  This adds up when calling this method 1,000,000 times per second!

A brute-force approach is to by-pass the Python API altogether (session.py) and call the C++ API directly (TF_Run) when invoking a forward-propagation through the network - aka. invoke a prediction.

Below is some sample TensorFlow code showing how to call the TF_Run C++ code directly from Python.

fetch_list0 = [b'MatMul_2:0']
feed_dict0 = {}

target_list0 = []
run_metadata0 = None

status0 = errors.raise_exception_on_not_ok_status()

options0 = None

status_ctx0 = status.__enter__()

result2 = tf_session.TF_Run(session, options0,feed_dict0, fetch_list0, target_list0,status, run_metadata0)
Did this answer your question?