Now, let’s leap into the actual deal of this text. Analyzing (Q, Okay, V, O) matrices of Llama-3–8B-Instruct mannequin by way of their singular values!
The Code
Let’s first import all obligatory packages wanted on this evaluation.
import transformersimport torchimport numpy as npfrom transformers import AutoConfig, LlamaModelfrom safetensors import safe_openimport osimport matplotlib.pyplot as plt
Then, let’s obtain the mannequin and put it aside into our native /tmpdirectory.
MODEL_ID = “meta-llama/Meta-Llama-3-8B-Instruct”!huggingface-cli obtain {MODEL_ID} –quiet –local-dir /tmp/{MODEL_ID}
Should you’re GPU-rich, the next code won’t be related for you. Nonetheless, in the event you’re GPU-poor like me, the next code will probably be actually helpful to load solely particular layers of the LLama-3–8B mannequin.
def load_specific_layers_safetensors(mannequin, model_name, layer_to_load):state_dict = {}recordsdata = [f for f in os.listdir(model_name) if f.endswith(‘.safetensors’)]for file in recordsdata:filepath = os.path.be part of(model_name, file)with safe_open(filepath, framework=”pt”) as f:for key in f.keys():if f”layers.{layer_to_load}.” in key:new_key = key.exchange(f”mannequin.layers.{layer_to_load}.”, ‘layers.0.’)state_dict[new_key] = f.get_tensor(key)
missing_keys, unexpected_keys = mannequin.load_state_dict(state_dict, strict=False)if missing_keys:print(f”Lacking keys: {missing_keys}”)if unexpected_keys:print(f”Sudden keys: {unexpected_keys}”)
The rationale we do it is because the free tier of Google Colab GPU shouldn’t be sufficient to load LLama-3–8B even with fp16 precision. Moreover, this evaluation requires us to work on fp32 precision on account of how the np.linalg.svd is constructed. Subsequent, we are able to outline the principle perform to get singular values for a given matrix_type , layer_number , and head_number.
def get_singular_values(model_path, matrix_type, layer_number, head_number):”””Computes the singular values of the required matrix within the Llama-3 mannequin.
Parameters:model_path (str): Path to the modelmatrix_type (str): Kind of matrix (‘q’, ‘okay’, ‘v’, ‘o’)layer_number (int): Layer quantity (0 to 31)head_number (int): Head quantity (0 to 31)
Returns:np.array: Array of singular values”””assert matrix_type in [‘q’, ‘k’, ‘v’, ‘o’], “Invalid matrix kind”assert 0 <= layer_number < 32, “Invalid layer quantity”assert 0 <= head_number < 32, “Invalid head quantity”
# Load the mannequin just for that particular layer since we have now restricted RAM even after utilizing fp16config = AutoConfig.from_pretrained(model_path)config.num_hidden_layers = 1model = LlamaModel(config)load_specific_layers_safetensors(mannequin, model_path, layer_number)
# Entry the required layer# At all times index 0 since we have now loaded for the particular layerlayer = mannequin.layers[0]
# Decide the scale of every headnum_heads = layer.self_attn.num_headshead_dim = layer.self_attn.head_dim
# Entry the required matrixweight_matrix = getattr(layer.self_attn, f”{matrix_type}_proj”).weight.detach().numpy()if matrix_type in [‘q’,’o’]:begin = head_number * head_dimend = (head_number + 1) * head_dimelse: # ‘okay’, ‘v’ matrices# Modify the head_number based mostly on num_key_value_heads# That is accomplished since llama3-8b use Grouped Question Attentionnum_key_value_groups = num_heads // config.num_key_value_headshead_number_kv = head_number // num_key_value_groupsstart = head_number_kv * head_dimend = (head_number_kv + 1) * head_dim
# Extract the weights for the required headif matrix_type in [‘q’, ‘k’, ‘v’]:weight_matrix = weight_matrix[start:end, :]else: # ‘o’ matrixweight_matrix = weight_matrix[:, start:end]
# Compute singular valuessingular_values = np.linalg.svd(weight_matrix, compute_uv=False)
del mannequin, config
return record(singular_values)
It’s value noting that we are able to extract the weights for the required head on the Okay, Q, and V matrices by doing row-wise slicing due to how it’s applied by HuggingFace.
As for the O matrix, we are able to do column-wise slicing to extract the weights for the required head on the O weight due to linear algebra! Particulars may be seen within the following determine.