python字典{：4}_升级您的Python技能：检查字典

2023-09-06 阅读 30 评论 0

摘要：python字典{：>4}by Adam Goldschmidt 亚当戈德施密特(Adam Goldschmidt) 升级您的Python技能：检查字典 (Upgrade your Python skills: Examining the Dictionary) a hash table (hash map) is a data structure that implements an associative array abstra

python字典{：>4}

by Adam Goldschmidt

亚当·戈德施密特(Adam Goldschmidt)

升级您的Python技能：检查字典 (Upgrade your Python skills: Examining the Dictionary)

a hash table (hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values.
哈希表(哈希表)是一种实现关联数组抽象数据类型的数据结构，该结构可以将键映射到值。

If it smells like a Python dict, feels like a dict, and looks like one… well, it must be a dict. Absolutely! Oh, And a set too...

如果它闻起来像Python dict ，感觉像是dict ，并且看起来像……，那么，那一定是dict 。绝对！哦还有set

?？ (Huh?)

Dictionaries and sets in Python are implemented using a hash table. It may sound daunting at first, but as we investigate further everything should be clear.

Python中的字典和集合是使用哈希表实现的。起初听起来可能令人生畏，但随着我们进一步调查，一切都应该清楚。

目的 (Objective)

Throughout this article, we will discover how a dict is implemented in Python, and we will build our own implementation of (a simple) one. The article is divided into three parts, and building our custom dictionary takes place in the first two:

在整个本文中，我们将发现如何在Python中实现dict ，并且将构建自己的实现(一个简单的实现)。本文分为三个部分，构建我们的自定义词典发生在前两个部分：

Understanding what hash tables are and how to use them
了解什么是哈希表以及如何使用它们
Diving into Python’s source code to better understand how dictionaries are implemented
深入研究Python的源代码以更好地理解字典的实现方式
Exploring differences between the dictionary and other data structures such as lists and sets
探索字典与其他数据结构(例如列表和集合)之间的差异

什么是哈希表？ (What is a hash table?)

A hash table is a structure that is designed to store a list of key-value pairs, without compromising on speed and efficiency of manipulating and searching the structure.

哈希表是一种旨在存储键-值对列表的结构，而不会影响操纵和搜索该结构的速度和效率。

The effectiveness of the hash table is derived from the hash function — a function that computes the index of the key-value pair — Meaning we can quickly insert, search and remove elements since we know their index in the memory array.

哈希表的有效性源自哈希函数 (该函数计算键值对的索引)，这意味着我们可以快速插入，搜索和删除元素，因为我们知道它们在内存数组中的索引。

The complexity begins when two of our keys hash to the same value. This scenario is called a hash collision. There are many different ways of handling a collision, but we will only cover Python’s way. We won’t go too deep with our hash table explanation for the sake of keeping this article beginner-friendly and Python-focused.

当我们的两个键散列到相同的值时，复杂性就开始了。这种情况称为哈希冲突 。处理冲突有很多不同的方法，但是我们仅介绍Python的方法。为了使本文对初学者友好且以Python为重点，我们对哈希表的解释不会太深。

Let’s make sure we wrapped our head around the concept of hash tables before moving on. We will start by creating the skeletons for our very (very) simple custom dict consisting of only insertion and search methods, using some of Python's dunder methods. We will need to initialise the hash table with a list of a specific size, and enable subscription ([] sign) for it:

在继续进行之前，让我们确保围绕哈希表的概念。首先，我们将使用Python的dunder方法为非常(非常)简单的自定义dict创建框架，该dict仅由插入和搜索方法组成。我们将需要使用特定大小的列表来初始化哈希表，并为其启用订阅([]符号)：

Now, our hash table list needs to hold specific structures, each one containing a key, a value and a hash:

现在，我们的哈希表列表需要保存特定的结构，每个结构包含一个键，一个值和一个哈希：

基本范例 (Basic Example)

A small company with 10 employees want to keep records containing their employees remaining sick days. We can use the following hash function, so everything can fit in the memory array:

一家有10名员工的小公司想要保留包含其员工剩余病假的记录。我们可以使用以下哈希函数，因此所有内容都可以放入内存数组中：

length of the employee's name % TABLE_SIZE

Let’s define our hash function in the Entry class:

让我们在Entry类中定义哈希函数：

Now we can initialise a 10 element array in our table:

现在我们可以在表中初始化一个10元素数组：

Wait! Let’s think it over. We most probably will tackle some hash collisions. If we only have 10 elements, it will be much harder for us to find an open space after a collision. Let’s decide that our table is going to have double the size — 20 elements! It will come handy in the future, I promise.

等待！让我们考虑一下。我们很可能会解决一些哈希冲突。如果我们只有10个元素，那么碰撞后我们很难找到开放空间。让我们决定表的大小将增加一倍-20个元素！我保证，将来会很方便。

To quickly insert each employee, we will follow the logic:

为了快速插入每位员工，我们将遵循以下逻辑：

array[length of the employee's name % 20] = employee_remaining_sick_days

So our insertion method will look like the following (no hash collision handling yet):

因此，我们的插入方法如下所示(尚无哈希冲突处理)：

For searching, we basically do the same:

对于搜索，我们基本上执行相同的操作：

array[length of the employee's first name % 20]

We’re not done yet!

我们还没有完成！

Python冲突处理 (Python collision handling)

Python uses a method called Open Addressing for handling collisions. It also resizes the hash tables when it reaches a certain size, but we won’t discuss that aspect. Open Addressing definition from Wikipedia:

Python使用一种称为“开放地址”的方法来处理冲突。当哈希表达到一定大小时，它还会调整哈希表的大小，但是我们不会在这方面进行讨论。来自Wikipedia的 Open Addressing定义：

In another strategy, called open addressing, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.
在另一种称为开放式寻址的策略中，所有条目记录都存储在存储桶数组本身中。 当必须插入新条目时，将检查存储桶，从哈希到的插槽开始，并按某些探测序列进行操作，直到找到未占用的插槽。 搜索条目时，将按相同顺序扫描存储桶，直到找到目标记录或找到未使用的阵列插槽为止，这表明表中没有此类键。

Let’s examine the process of retrieving a value by key, by looking at Python source code (written in C):

让我们通过查看Python 源代码 (用C编写)来检查通过key检索值的过程：

Calculate hash of key
的计算哈希key
Calculate the index of the item by hash & mask where mask = HASH_TABLE_SIZE-1 (in simple terms - take N last bits from the hash bits):
通过hash & mask计算项的index ，其中mask = HASH_TABLE_SIZE-1 (简单地说-从哈希位中获取N个最后位)：

i = (size_t)hash & mask;

3. If empty, return DKIX_EMPTY which translates eventually to a KeyError:

3.如果为空，返回DKIX_EMPTY最终转化为一个KeyError ：

if (ix == DKIX_EMPTY) {   *value_addr = NULL;   return ix;}

4. If not empty, compare keys & hashes and set value_addr address to the actual value address if equal:

4.如果不为空，则比较键和散列，并将value_addr地址设置为实际值地址(如果相等)：

if (ep->me_key == key) {    *value_addr = ep->me_value;    return ix;}

and:

和：

if (dk == mp->ma_keys && ep->me_key == startkey) {    if (cmp > 0) {        *value_addr = ep->me_value;        return ix;    }}

5. If not equal, use different bits of the hash (algorithm explained here) and go to step 3 again:

5.如果不相等，请使用哈希的不同位( 此处解释了算法)，然后再次转到步骤3：

perturb >>= PERTURB_SHIFT;i = (i*5 + perturb + 1) & mask;

Here’s a diagram to illustrate the whole process:

这是说明整个过程的图：

The insertion process is pretty similar — if the found slot is empty, the entry is being inserted, if it’s not empty then we compare the key and the hash — if equal, we replace the value, and if not we continue our quest of finding a new spot with the perturb algorithm.

插入过程非常相似-如果找到的插槽为空，则插入项，如果不为空，则将键和哈希值进行比较-如果相等，则替换该值，否则，我们继续寻找perturb算法的新发现。

从Python借用想法 (Borrowing ideas from Python)

We can borrow Python’s idea of comparing both keys and hashes of each entry to our entry object (replacing the previous method):

我们可以借用Python将每个条目的键和哈希值与我们的条目对象进行比较的想法(代替先前的方法)：

Our hash table still does not have any collision handling — let’s implement one! As we saw earlier, Python does it by comparing entries and then changing the mask of the bits, but we will do it by using a method called linear probing (which is a form of open addressing, explained above):

我们的哈希表仍然没有任何冲突处理-让我们实现一个！如我们先前所见，Python通过比较条目然后更改位的掩码来完成此操作，但是我们将使用称为线性探测的方法(这是一种开放式寻址的方法，如上所述)进行操作：

When the hash function causes a collision by mapping a new key to a cell of the hash table that is already occupied by another key, linear probing searches the table for the closest following free location and inserts the new key there.
当哈希函数通过将新键映射到哈希表中已被另一个键占用的单元格而引起冲突时，线性探测将在表中搜索紧随其后的自由位置，并将新键插入该表中。

So what we’re going to do is to move forward until we find an open space. If you recall, we implemented our table with double the size (20 elements and not 10) — This is where it comes handy. When we move forward, our search of an open space will be much quicker because there’s more room!

因此，我们要做的就是继续前进，直到找到一个开放的空间。您还记得吗，我们以两倍大的大小(20个元素而不是10个)实现了我们的表- 这很方便 。当我们前进时，由于有更多的空间，我们对开放空间的搜索将更快。

But we have a problem. What if someone evil tries to insert the 11th element? We need to raise an error (we won’t be dealing with table resizing in this article). We can keep a counter of filled entries in our table:

但是我们有一个问题。如果某个邪恶的人试图插入第11个元素怎么办？我们需要引发一个错误(在本文中，我们将不处理表大小调整)。我们可以在表中保留一个填充条目的计数器：

Now let’s implement the same in our searching method:

现在，让我们在搜索方法中实现相同的功能：

The full code can be found here.

完整的代码可以在这里找到。

Now the company can safely store sick days for each employee:

现在，公司可以安全地为每个员工存储病假：

Python集 (Python Set)

Going back to the beginning of the article, set and dict in Python are implemented very similarly, with set using only key and hash inside each record, as can be seen in the source code:

回到文章的开头，Python中的set和dict的实现非常相似， set在每个记录中仅使用key和hash ，如源代码所示：

typedef struct {    PyObject *key;    Py_hash_t hash; /* Cached hash code of the key */} setentry;

As opposed to dict, that holds a value:

与dict相反，它拥有一个值：

typedef struct {    /* Cached hash code of me_key. */    Py_hash_t me_hash;    PyObject *me_key;    PyObject *me_value; /* This field is only meaningful for combined tables */} PyDictKeyEntry;

绩效与秩序 (Performance and Order)

时间比较 (Time comparison)

I think it’s now clear that a dict is much much faster than a list (and takes way more memory space), in terms of searching, inserting (at a specific place) and deleting. Let's validate that assumption with some code (I am running the code on a 2017 MacBook Pro):

我认为现在很明显，在搜索，插入(在特定位置)和删除方面， dict比list快得多(并且占用更多的内存空间)。让我们用一些代码验证该假设(我正在2017 MacBook Pro上运行代码)：

And the following is the test code (once for the dict and once for the list, replacing d):

以下是测试代码(一次用于dict ，一次用于list ，替换d )：

The results are, well, pretty much what we expected..

结果差不多是我们预期的。

dict: 0.015382766723632812 seconds

dict ： 0.015382766723632812秒

list: 55.5544171333313 seconds

list: 55.5544171333313秒

顺序取决于插入顺序 (Order depends on insertion order)

The order of the dict depends on the history of insertion. If we insert an entry with a specific hash, and afterwards an entry with the same hash, the second entry is going to end up in a different place then if we were to insert it first.

字典的顺序取决于插入的历史记录。如果我们插入具有特定哈希值的条目，然后插入具有相同哈希值的条目，则第二个条目将以与我们首先插入它不同的地方结束。